Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value.

Rusty Bargain is interested in:

  • the quality of the prediction;
  • the speed of the prediction;
  • the time required for training

Target = price

Environment Setup & Required Libraries¶

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from  catboost import CatBoostRegressor
from xgboost import XGBRegressor
import lightgbm as lgb


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import time
import gc

Data preparation¶

In [2]:
df = pd.read_csv("/datasets/car_data.csv")
display(df)
DateCrawled Price VehicleType RegistrationYear Gearbox Power Model Mileage RegistrationMonth FuelType Brand NotRepaired DateCreated NumberOfPictures PostalCode LastSeen
0 24/03/2016 11:52 480 NaN 1993 manual 0 golf 150000 0 petrol volkswagen NaN 24/03/2016 00:00 0 70435 07/04/2016 03:16
1 24/03/2016 10:58 18300 coupe 2011 manual 190 NaN 125000 5 gasoline audi yes 24/03/2016 00:00 0 66954 07/04/2016 01:46
2 14/03/2016 12:52 9800 suv 2004 auto 163 grand 125000 8 gasoline jeep NaN 14/03/2016 00:00 0 90480 05/04/2016 12:47
3 17/03/2016 16:54 1500 small 2001 manual 75 golf 150000 6 petrol volkswagen no 17/03/2016 00:00 0 91074 17/03/2016 17:40
4 31/03/2016 17:25 3600 small 2008 manual 69 fabia 90000 7 gasoline skoda no 31/03/2016 00:00 0 60437 06/04/2016 10:17
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
354364 21/03/2016 09:50 0 NaN 2005 manual 0 colt 150000 7 petrol mitsubishi yes 21/03/2016 00:00 0 2694 21/03/2016 10:42
354365 14/03/2016 17:48 2200 NaN 2005 NaN 0 NaN 20000 1 NaN sonstige_autos NaN 14/03/2016 00:00 0 39576 06/04/2016 00:46
354366 05/03/2016 19:56 1199 convertible 2000 auto 101 fortwo 125000 3 petrol smart no 05/03/2016 00:00 0 26135 11/03/2016 18:17
354367 19/03/2016 18:57 9200 bus 1996 manual 102 transporter 150000 3 gasoline volkswagen no 19/03/2016 00:00 0 87439 07/04/2016 07:15
354368 20/03/2016 19:41 3400 wagon 2002 manual 100 golf 150000 6 gasoline volkswagen NaN 20/03/2016 00:00 0 40764 24/03/2016 12:45

354369 rows × 16 columns

In [3]:
# Inspect dataset
df1 = df.copy()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(9)
memory usage: 43.3+ MB

Standardize Columns¶

In [4]:
df.columns = df.columns.str.lower()
display(df.head())
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
0 24/03/2016 11:52 480 NaN 1993 manual 0 golf 150000 0 petrol volkswagen NaN 24/03/2016 00:00 0 70435 07/04/2016 03:16
1 24/03/2016 10:58 18300 coupe 2011 manual 190 NaN 125000 5 gasoline audi yes 24/03/2016 00:00 0 66954 07/04/2016 01:46
2 14/03/2016 12:52 9800 suv 2004 auto 163 grand 125000 8 gasoline jeep NaN 14/03/2016 00:00 0 90480 05/04/2016 12:47
3 17/03/2016 16:54 1500 small 2001 manual 75 golf 150000 6 petrol volkswagen no 17/03/2016 00:00 0 91074 17/03/2016 17:40
4 31/03/2016 17:25 3600 small 2008 manual 69 fabia 90000 7 gasoline skoda no 31/03/2016 00:00 0 60437 06/04/2016 10:17
In [5]:
df.loc[251638,['model']] = 'wrangler'

Details to Help with Data Cleaning¶

General

  • 1769: Steam Wagon (Nicolas-Joseph Cugnot, France) Steam-powered, heavy, experimental — not practical
  • 1800s: Steam carriages - Small numbers in UK & France, for private roads
  • 1830s–1890s: Electric vehicles - Short-range city vehicles, mostly experimental or low-volume
  • The first gasoline car was made as early as 1885
  • The first car to receive registration was on August 14th, 1893

Automation of Vehicle History

  • 1904: Sturtevant Automatic Automobile
  • 1939/1940: Cadillac & Oldsmobile w/ Hydra-Matic by General Motors
  • 1941: Buick (military - WWII civilian car production halt (1942)) - Chrysler Fluid Drive / Vacamatic / Prestomatic
  • 1948: Buick Roadmaster / Dynaflow (1949)
  • 1950: Powerglide by Chevrolet
  • 1961: K4A Mercedes-Benz
    • most Cadillac, Oldsmobile, Buick, and Chrysler
  • 1962 - : Automatics rapidly expanded

First Car by Model (Earliest Registration Year)

  • Rover: 1885
  • Mercedes-Benz: 1886
  • Peugeot: 1889
  • Opel: 1899
  • Renault: 1899
  • Fiat: 1899
  • Ford: 1903
  • Škoda: 1905
  • Lancia: 1906
  • Daihatsu: 1907
  • Suzuki: 1909
  • Audi: 1910
  • Alfa Romeo: 1910
  • Chevrolet: 1911
  • Mitsubishi: 1917
  • Citroën: 1919
  • Jaguar: 1922
  • Chrysler: 1924
  • Volvo: 1927
  • BMW: 1928
  • Mazda: 1931
  • Porsche: 1931
  • Nissan: 1933
  • Toyota: 1936
  • Volkswagen: 1937
  • Jeep: 1941
  • Kia: 1944
  • Saab: 1947
  • Honda: 1948
  • Land Rover: 1948
  • SEAT: 1950
  • Subaru: 1954
  • Trabant: 1957
  • Mini: 1959
  • Dacia: 1966
  • Daewoo: 1967
  • Hyundai: 1967
  • Lada: 1970
  • Smart: 1998
  • Sonstige_autos: N/A (miscellaneous)
In [6]:
# Look at years before 1885 and after 2025
df[(df["registrationyear"] > 2025) | (df["registrationyear"] < 1885)]
Out[6]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
622 16/03/2016 16:55 0 NaN 1111 NaN 0 NaN 5000 0 NaN opel NaN 16/03/2016 00:00 0 44628 20/03/2016 16:44
12946 29/03/2016 18:39 49 NaN 5000 NaN 0 golf 5000 12 NaN volkswagen NaN 29/03/2016 00:00 0 74523 06/04/2016 04:16
15147 14/03/2016 00:52 0 NaN 9999 NaN 0 NaN 10000 0 NaN sonstige_autos NaN 13/03/2016 00:00 0 32689 21/03/2016 23:46
15870 02/04/2016 11:55 1700 NaN 3200 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 02/04/2016 00:00 0 33649 06/04/2016 09:46
16062 29/03/2016 23:42 190 NaN 1000 NaN 0 mondeo 5000 0 NaN ford NaN 29/03/2016 00:00 0 47166 06/04/2016 10:44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340548 02/04/2016 17:44 0 NaN 3500 manual 75 NaN 5000 3 petrol sonstige_autos NaN 02/04/2016 00:00 0 96465 04/04/2016 15:17
340759 04/04/2016 23:55 700 NaN 1600 manual 1600 a3 150000 4 petrol audi no 04/04/2016 00:00 0 86343 05/04/2016 06:44
341791 28/03/2016 17:37 1 NaN 3000 NaN 0 zafira 5000 0 NaN opel NaN 28/03/2016 00:00 0 26624 02/04/2016 22:17
348830 22/03/2016 00:38 1 NaN 1000 NaN 1000 NaN 150000 0 NaN sonstige_autos NaN 21/03/2016 00:00 0 41472 05/04/2016 14:18
351682 12/03/2016 00:57 11500 NaN 1800 NaN 16 other 5000 6 petrol fiat NaN 11/03/2016 00:00 0 16515 05/04/2016 19:47

171 rows × 16 columns

In [7]:
# First registration reported in 1885; registration dates before this are incorrect
car_dates = df[(df["registrationyear"] > 2025) & (df['model'].isna()) | 
    (df["registrationyear"] < 1885) & (df['model'].isna())]
car_dates

# Incorrect registration dates need to be marked as Nan
df.loc[(df["registrationyear"] > 2025) & (df['model'].isna()) | (df["registrationyear"] < 1885) & 
    (df['model'].isna()),["registrationyear"]] = np.nan

First Car by Model (Earliest Registration Year)

  • Rover: 1885
  • Mercedes-Benz: 1886
  • Peugeot: 1889
  • Opel: 1899
  • Renault: 1899
  • Fiat: 1899
  • Ford: 1903
  • Škoda: 1905
  • Lancia: 1906
  • Daihatsu: 1907
  • Suzuki: 1909
  • Audi: 1910
  • Alfa Romeo: 1910
  • Chevrolet: 1911
  • Mitsubishi: 1917
  • Citroën: 1919
  • Jaguar: 1922
  • Chrysler: 1924
  • Volvo: 1927
  • BMW: 1928
  • Mazda: 1931
  • Porsche: 1931
  • Nissan: 1933
  • Toyota: 1936
  • Volkswagen: 1937
  • Jeep: 1941
  • Kia: 1944
  • Saab: 1947
  • Honda: 1948
  • Land Rover: 1948
  • SEAT: 1950
  • Subaru: 1954
  • Trabant: 1957
  • Mini: 1959
  • Dacia: 1966
  • Daewoo: 1967
  • Hyundai: 1967
  • Lada: 1970
  • Smart: 1998
  • Sonstige_autos: N/A (miscellaneous)

Brands that do not have registration dates before earliest record

  • Lada
  • Daewoo
  • Dacia
  • Mini
  • SEAT
  • Land Rover
  • Honda
  • Saab
  • Kia
  • Nissan
  • Porsche
  • Mazda
  • Jaguar
  • Chrysler
  • Volvo
  • Rover
  • Mercedes-Benz
  • Peugeot
  • Opel
  • Renault
  • Fiat
  • Ford
  • Škoda
  • Lancia
  • Daihatsu
  • Suzuki
  • Audi
  • Alfa Romeo
  • Chevrolet
In [8]:
# Look at smart cars registered before 1998
df[(df['brand'] == 'smart') & (df['registrationyear'] < 1998)]

smartnan = (df['brand'] == 'smart') & (df['registrationyear'] < 1998) & (df['model'].isna())

df.loc[smartnan,['registrationyear']] = np.nan

# Hyundai before 1967 implausible
hyundai = (df['brand'] == 'hyundai') & (df['registrationyear'] < 1967) & (df['model'].isna())
df.loc[hyundai, ['registrationyear']] = np.nan


remaining = ['smart', 'hyundai', 'mitsubishi', 'citroen', 'bmw', 'toyota', 'volkswagen', 'jeep', 'subaru', 'trabant']
earliest_years = {'smart': 1998, 'hyundai': 1967,'mitsubishi': 1917, 'citroen': 1919, 'bmw': 1928, 'toyota': 1936, 
                  'volkswagen': 1937, 'jeep': 1941, 'subaru': 1954, 'trabant': 1957}

for brands in remaining:
    df.loc[(df['brand'] == brands) & (df['registrationyear'] < earliest_years[brands]) & 
        df['model'].isna(), ['registrationyear']] = np.nan
    display(df[(df['brand'] == brands) & (df['registrationyear'] < earliest_years[brands])])
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
31212 12/03/2016 16:45 700 small 1997.0 NaN 0 forfour 5000 3 petrol smart NaN 12/03/2016 00:00 0 88416 07/04/2016 06:17
161667 04/04/2016 20:56 1650 small 1992.0 auto 55 fortwo 100000 7 petrol smart no 04/04/2016 00:00 0 28327 06/04/2016 23:44
319739 05/04/2016 20:36 1650 small 1992.0 NaN 0 fortwo 100000 6 NaN smart yes 05/04/2016 00:00 0 28327 05/04/2016 20:36
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
244840 09/03/2016 17:50 0 NaN 1910.0 NaN 0 other 5000 0 NaN hyundai NaN 09/03/2016 00:00 0 59510 07/04/2016 10:44
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
154559 03/04/2016 12:40 0 small 1910.0 manual 0 colt 150000 0 petrol mitsubishi NaN 03/04/2016 00:00 0 46397 07/04/2016 14:57
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
125577 15/03/2016 18:38 7750 NaN 1001.0 NaN 0 other 5000 0 NaN citroen NaN 15/03/2016 00:00 0 66706 06/04/2016 18:47
270911 23/03/2016 11:48 0 other 1910.0 manual 0 other 5000 0 petrol citroen no 23/03/2016 00:00 0 98630 23/03/2016 11:48
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
58883 15/03/2016 21:57 1 NaN 1910.0 NaN 0 3er 150000 0 NaN bmw NaN 15/03/2016 00:00 0 74074 07/04/2016 07:17
119442 18/03/2016 10:37 1 NaN 1000.0 NaN 1000 3er 5000 0 NaN bmw NaN 18/03/2016 00:00 0 94086 05/04/2016 22:16
203230 01/04/2016 15:37 400 NaN 1910.0 manual 170 3er 5000 0 NaN bmw NaN 01/04/2016 00:00 0 66333 03/04/2016 11:48
213499 08/03/2016 12:06 380 NaN 1000.0 NaN 0 6er 5000 0 NaN bmw NaN 08/03/2016 00:00 0 35102 06/04/2016 00:16
287304 09/03/2016 15:54 500 NaN 1602.0 manual 0 other 5000 0 NaN bmw yes 09/03/2016 00:00 0 30900 10/03/2016 12:17
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
23750 16/03/2016 19:58 3900 wagon 1910.0 manual 90 passat 150000 0 petrol volkswagen NaN 16/03/2016 00:00 0 88662 07/04/2016 05:45
35943 19/03/2016 10:57 200 other 1910.0 NaN 0 caddy 150000 0 gasoline volkswagen NaN 19/03/2016 00:00 0 35096 20/03/2016 18:10
40133 23/03/2016 18:00 0 NaN 1910.0 NaN 0 other 5000 0 NaN volkswagen NaN 23/03/2016 00:00 0 85045 23/03/2016 18:41
53577 20/03/2016 11:44 330 NaN 1000.0 NaN 0 polo 5000 0 NaN volkswagen NaN 20/03/2016 00:00 0 45259 04/04/2016 08:17
56241 30/03/2016 18:54 950 NaN 1400.0 manual 1400 golf 125000 4 petrol volkswagen NaN 30/03/2016 00:00 0 50389 03/04/2016 09:45
62803 07/03/2016 22:58 3400 small 1910.0 manual 90 beetle 90000 4 NaN volkswagen no 07/03/2016 00:00 0 34308 12/03/2016 08:16
71459 27/03/2016 23:46 500 NaN 1000.0 NaN 0 golf 5000 0 NaN volkswagen NaN 27/03/2016 00:00 0 91628 29/03/2016 13:46
74814 21/03/2016 12:52 400 NaN 1910.0 NaN 60 golf 150000 0 petrol volkswagen NaN 21/03/2016 00:00 0 29462 25/03/2016 09:17
143621 17/03/2016 23:40 550 NaN 1000.0 NaN 1000 golf 5000 6 petrol volkswagen NaN 17/03/2016 00:00 0 91732 26/03/2016 05:18
144388 09/03/2016 20:52 50 NaN 1910.0 NaN 0 kaefer 5000 0 NaN volkswagen NaN 09/03/2016 00:00 0 50374 05/04/2016 18:46
147663 03/04/2016 19:37 0 NaN 1910.0 NaN 0 polo 5000 0 NaN volkswagen NaN 03/04/2016 00:00 0 2826 05/04/2016 20:15
151280 05/04/2016 00:39 300 NaN 1910.0 manual 0 golf 150000 0 petrol volkswagen NaN 04/04/2016 00:00 0 89269 05/04/2016 05:42
164397 29/03/2016 17:49 0 NaN 1000.0 NaN 0 transporter 5000 1 NaN volkswagen NaN 29/03/2016 00:00 0 29351 06/04/2016 03:45
174893 05/03/2016 19:48 0 NaN 1000.0 NaN 1000 golf 5000 4 petrol volkswagen NaN 05/03/2016 00:00 0 35716 05/03/2016 22:27
183727 03/04/2016 12:48 0 bus 1910.0 NaN 0 transporter 5000 0 NaN volkswagen NaN 03/04/2016 00:00 0 84478 03/04/2016 12:48
189722 29/03/2016 16:56 1500 NaN 1000.0 manual 0 kaefer 5000 0 petrol volkswagen NaN 29/03/2016 00:00 0 48324 31/03/2016 10:15
203985 07/03/2016 14:53 222 NaN 1910.0 manual 0 golf 5000 0 petrol volkswagen NaN 07/03/2016 00:00 0 26802 12/03/2016 04:15
218241 16/03/2016 12:46 7999 NaN 1800.0 NaN 290 golf 5000 6 NaN volkswagen NaN 16/03/2016 00:00 0 15827 29/03/2016 20:47
256532 05/03/2016 17:44 12500 NaN 1000.0 NaN 200 golf 5000 0 NaN volkswagen NaN 28/02/2016 00:00 0 75378 07/04/2016 12:17
276318 31/03/2016 14:58 300 NaN 1910.0 NaN 0 polo 5000 0 NaN volkswagen NaN 31/03/2016 00:00 0 53902 06/04/2016 08:16
286928 18/03/2016 16:51 1 NaN 1000.0 NaN 174 touareg 5000 3 gasoline volkswagen NaN 18/03/2016 00:00 0 97616 05/04/2016 22:44
318111 25/03/2016 13:42 1 NaN 1910.0 NaN 0 golf 125000 0 NaN volkswagen NaN 25/03/2016 00:00 0 54295 06/04/2016 15:44
318501 02/04/2016 13:57 0 NaN 1910.0 NaN 0 caddy 5000 0 NaN volkswagen NaN 02/04/2016 00:00 0 16949 06/04/2016 12:16
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
18224 09/03/2016 17:49 7999 NaN 1500.0 manual 224 impreza 5000 3 NaN subaru NaN 09/03/2016 00:00 0 53577 15/03/2016 05:15
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen
199563 09/03/2016 20:37 60 wagon 1956.0 NaN 0 other 150000 0 NaN trabant NaN 09/03/2016 00:00 0 16775 05/04/2016 16:45
294028 28/03/2016 23:45 0 NaN 1111.0 NaN 0 601 5000 0 NaN trabant NaN 28/03/2016 00:00 0 6712 30/03/2016 16:45
In [9]:
# Kaefer's is the german name for beetle - they are the same car
beetle = df['model'] == 'kaefer'
df.loc[beetle, ['model']] = 'beetle'
In [10]:
del smartnan
del hyundai
del beetle
gc.collect()
Out[10]:
4
In [11]:
df['registrationyear'] = pd.to_numeric(df['registrationyear'], errors = 'coerce')

# Define models and earliest years
model_cols = ['fortwo', 'forfour', 'colt', '3er', '6er', 'passat', 'caddy', 
              'polo', 'golf', 'beetle', 'transporter', 'touareg', 
              'impreza', '601']


earliest_registration = {
    'fortwo': 1998,
    'forfour': 2004,
    'colt': 1962,
    '3er': 1975,
    '6er': 1976,
    'passat': 1973,
    'caddy': 1980,
    'polo': 1975,
    'golf': 1974,
    'beetle': 1938,
    'transporter': 1950,
    'touareg': 2002,
    'impreza': 1992,
    '601': 1964
}


latest_registration = {
    'fortwo': 2025,
    'forfour': 2025,
    'colt': 2013,
    '3er': 2025,
    '6er': 2025,
    'passat': 2025,
    'caddy': 2025,
    'polo': 2025,
    'golf': 2025,
    'beetle': 2019,
    'transporter': 2025,
    'touareg': 2025,
    'impreza': 2025,
    '601': 1991
}



df['registration_correction'] = np.nan

for model in model_cols:
    # Too early
    too_early = (df['model'] == model) & (df['registrationyear'] < earliest_registration[model])
    df.loc[too_early, ['registration_correction']] = "Y: too early"
    # Too late
    too_late = (df['model'] == model) & (df['registrationyear'] > latest_registration[model])
    df.loc[too_late, ['registration_correction']] = "Y: too late"
    # Missing
    missing = df['registrationyear'].isna()
    df.loc[missing,['registration_correction']] = "Y: missing"
    # acceptable registration
    acceptable = (df['model'] == model) & ((df['registrationyear'] >= earliest_registration[model]) & (df['registrationyear'] <= latest_registration[model]))
    df.loc[acceptable,['registration_correction']] = 'N'
else:
    np.nan
In [12]:
display(df['registration_correction'].isna().sum())



registration_years = {
    'corsa': {'earliest': 1982, 'latest': 2025},
    'astra': {'earliest': 1991, 'latest': 2025},
    'passat': {'earliest': 1973, 'latest': 2025},
    'a4': {'earliest': 1994, 'latest': 2025},
    'c_klasse': {'earliest': 1993, 'latest': 2025},
    '5er': {'earliest': 1972, 'latest': 2025},
    'e_klasse': {'earliest': 1993, 'latest': 2025},
    'a3': {'earliest': 1996, 'latest': 2025},
    'focus': {'earliest': 1998, 'latest': 2025},
    'fiesta': {'earliest': 1976, 'latest': 2025},
    'a6': {'earliest': 1994, 'latest': 2025},
    'twingo': {'earliest': 1993, 'latest': 2025},
    'transporter': {'earliest': 1950, 'latest': 2025},
    '2_reihe': {'earliest': 1982, 'latest': 2025},
    'vectra': {'earliest': 1988, 'latest': 2008},
    'a_klasse': {'earliest': 1997, 'latest': 2025},
    'mondeo': {'earliest': 1993, 'latest': 2025},
    'clio': {'earliest': 1991, 'latest': 2025},
    '1er': {'earliest': 2004, 'latest': 2025},
    '3_reihe': {'earliest': 1982, 'latest': 2025},
    'touran': {'earliest': 2003, 'latest': 2025},
    'punto': {'earliest': 1993, 'latest': 2025},
    'zafira': {'earliest': 1999, 'latest': 2025},
    'megane': {'earliest': 1995, 'latest': 2025},
    'ibiza': {'earliest': 1984, 'latest': 2025},
    'ka': {'earliest': 1996, 'latest': 2025},
    'lupo': {'earliest': 1998, 'latest': 2005},
    'octavia': {'earliest': 1996, 'latest': 2025},
    'fabia': {'earliest': 1999, 'latest': 2025},
    'cooper': {'earliest': 2001, 'latest': 2025},
    'clk': {'earliest': 1997, 'latest': 2010},
    'micra': {'earliest': 1982, 'latest': 2025},
    '80': {'earliest': 1972, 'latest': 1996},
    'caddy': {'earliest': 1980, 'latest': 2025},
    'x_reihe': {'earliest': 2000, 'latest': 2025},
    'sharan': {'earliest': 1995, 'latest': 2025},
    'scenic': {'earliest': 1996, 'latest': 2025},
    'omega': {'earliest': 1986, 'latest': 2003},
    'laguna': {'earliest': 1994, 'latest': 2025},
    'civic': {'earliest': 1972, 'latest': 2025},
    '1_reihe': {'earliest': 1970, 'latest': 2025},
    'leon': {'earliest': 1999, 'latest': 2025},
    '6_reihe': {'earliest': 2003, 'latest': 2025},
    'i_reihe': {'earliest': 2004, 'latest': 2025},
    'slk': {'earliest': 1996, 'latest': 2025},
    'galaxy': {'earliest': 1959, 'latest': 2025},
    'tt': {'earliest': 1998, 'latest': 2025},
    'meriva': {'earliest': 2003, 'latest': 2025},
    'yaris': {'earliest': 1999, 'latest': 2025},
    '7er': {'earliest': 1977, 'latest': 2025},
    'mx_reihe': {'earliest': 1989, 'latest': 2025},
    'kangoo': {'earliest': 1997, 'latest': 2025},
    'm_klasse': {'earliest': 1997, 'latest': 2025},
    '500': {'earliest': 1957, 'latest': 2025},
    'escort': {'earliest': 1968, 'latest': 2000},
    'arosa': {'earliest': 1997, 'latest': 2005},
    'one': {'earliest': 2001, 'latest': 2025},
    's_klasse': {'earliest': 1972, 'latest': 2025},
    'vito': {'earliest': 1996, 'latest': 2025},
    'b_klasse': {'earliest': 2005, 'latest': 2025},
    'bora': {'earliest': 1998, 'latest': 2005},
    'berlingo': {'earliest': 1996, 'latest': 2025},
    'tigra': {'earliest': 1994, 'latest': 2008},
    'v40': {'earliest': 1995, 'latest': 2025},
    'sprinter': {'earliest': 1995, 'latest': 2025},
    'transit': {'earliest': 1965, 'latest': 2025},
    'fox': {'earliest': 2003, 'latest': 2025},
    'z_reihe': {'earliest': 1998, 'latest': 2025},
    'swift': {'earliest': 1983, 'latest': 2025},
    'c_max': {'earliest': 2003, 'latest': 2025},
    'corolla': {'earliest': 1966, 'latest': 2025},
    'panda': {'earliest': 1980, 'latest': 2025},
    'seicento': {'earliest': 1998, 'latest': 2007},
    'tiguan': {'earliest': 2007, 'latest': 2025},
    'insignia': {'earliest': 2008, 'latest': 2025},
    '4_reihe': {'earliest': 1892, 'latest': 2025},
    'v70': {'earliest': 1997, 'latest': 2025},
    '156': {'earliest': 1997, 'latest': 2005},
    'primera': {'earliest': 1990, 'latest': 2007},
    'espace': {'earliest': 1984, 'latest': 2025},
    'scirocco': {'earliest': 1974, 'latest': 2017},
    'stilo': {'earliest': 2001, 'latest': 2008},
    'a1': {'earliest': 2010, 'latest': 2025},
    'almera': {'earliest': 1995, 'latest': 2006},
    '147': {'earliest': 2000, 'latest': 2010},
    'avensis': {'earliest': 1997, 'latest': 2025},
    'grand': {'earliest': 1924, 'latest': 2025},
    'a5': {'earliest': 2007, 'latest': 2025},
    'qashqai': {'earliest': 2006, 'latest': 2025},
    'a8': {'earliest': 1994, 'latest': 2025},
    'eos': {'earliest': 2006, 'latest': 2025},
    'c3': {'earliest': 2002, 'latest': 2025}
}



registration_cols = list(registration_years.keys())

for registration in registration_cols:
    earliest = registration_years[registration]['earliest']
    latest = registration_years[registration]['latest']
    # Too early
    early_reg = (df['model'] == registration) & (df['registrationyear'] < earliest) 
    df.loc[early_reg,['registration_correction']] = "Y: too early"
    # Too late
    late_reg = (df['model'] == registration) & (df['registrationyear'] > latest)
    df.loc[late_reg,['registration_correction']] = "Y: too late"
    # Acceptable range
    acc_reg = (df['model'] == registration) & (df['registrationyear'] >= earliest) & (df['registrationyear'] <= latest)
    df.loc[acc_reg,['registration_correction']] = 'N'
    

df['registration_correction'].isna().sum()

    
267415
Out[12]:
73063
In [13]:
df['registration_correction'].value_counts(dropna = False)
Out[13]:
N               278017
NaN              73063
Y: too early      1609
Y: too late       1583
Y: missing          97
Name: registration_correction, dtype: int64
In [14]:
remainder_models = df[(df['registration_correction'].isna()) & (df['model'].notna())]
remainder_models['model'].unique()
Out[14]:
array(['other', 'navara', 'c4', 'kadett', 'signum', 'jetta', 'forester',
       'xc_reihe', 'combo', 'jazz', '100', 'sportage', 'sorento',
       'mustang', 'getz', 'r19', 'cordoba', 'up', 'ceed', '5_reihe',
       'yeti', 'mii', 'rx_reihe', 'modus', 'matiz', 'c1', 'rio', 'logan',
       'spider', 'cuore', 's_max', 'a2', 'viano', 'roomster', 'sl',
       'santa', 'ptcruiser', 'exeo', '159', 'juke', 'carisma', 'accord',
       'lanos', 'phaeton', 'verso', 'rav', 'picanto', 'boxster', 'kalos',
       'superb', 'alhambra', 'roadster', 'ypsilon', 'cayenne', 'galant',
       'justy', '90', 'sirion', 'crossfire', 'agila', 'duster',
       'cr_reihe', 'v50', 'c_reihe', 'v_klasse', 'c5', 'aygo', 'cc',
       'carnival', 'fusion', '911', 'm_reihe', 'cl', '300c', 'spark',
       'kuga', 'x_type', 'ducato', 's_type', 'x_trail', 'toledo', 'altea',
       'voyager', 'calibra', 'bravo', 'antara', 'tucson', 'citigo',
       'jimny', 'wrangler', 'lybra', 'q7', 'lancer', 'captiva', 'c2',
       'discovery', 'freelander', 'sandero', 'note', '900', 'cherokee',
       'clubman', 'samara', 'defender', 'cx_reihe', 'legacy', 'pajero',
       'auris', 'niva', 's60', 'nubira', 'vivaro', 'g_klasse', 'lodgy',
       '850', 'range_rover', 'q3', 'serie_2', 'glk', 'charade', 'croma',
       'outlander', 'doblo', 'musa', 'move', '9000', 'v60', '145', 'aveo',
       '200', 'b_max', 'range_rover_sport', 'terios', 'rangerover', 'q5',
       'range_rover_evoque', 'materia', 'delta', 'gl', 'kalina', 'amarok',
       'elefantino', 'i3', 'kappa', 'serie_3', 'serie_1'], dtype=object)
In [15]:
reg_cols = ['navara', 'c4', 'kadett', 'signum', 'jetta', 'forester',
       'xc_reihe', 'combo', 'jazz', '100', 'sportage', 'sorento',
       'mustang', 'getz', 'r19', 'cordoba', 'up', 'ceed', '5_reihe',
       'yeti', 'mii', 'rx_reihe', 'modus', 'matiz', 'c1', 'rio', 'logan',
       'spider', 'cuore', 's_max', 'a2', 'viano', 'roomster', 'sl',
       'santa', 'ptcruiser', 'exeo', '159', 'juke', 'carisma', 'accord',
       'lanos', 'phaeton', 'verso', 'rav', 'picanto', 'boxster', 'kalos',
       'superb', 'alhambra']

reg_years = {
    'navara':   {'earliest': 1997, 'latest': 2025},
    'c4':       {'earliest': 2004, 'latest': 2025},
    'kadett':   {'earliest': 1937, 'latest': 1991},
    'signum':   {'earliest': 2003, 'latest': 2008},
    'jetta':    {'earliest': 1979, 'latest': 2018},
    'forester': {'earliest': 1997, 'latest': 2025},
    'xc_reihe': {'earliest': 2001, 'latest': 2025},
    'combo':    {'earliest': 1993, 'latest': 2025},
    'jazz':     {'earliest': 2001, 'latest': 2025},
    '100':      {'earliest': 1968, 'latest': 1994},
    'sportage': {'earliest': 1993, 'latest': 2025},
    'sorento':  {'earliest': 2002, 'latest': 2025},
    'mustang':  {'earliest': 1964, 'latest': 2025},
    'getz':     {'earliest': 2002, 'latest': 2011},
    'r19':      {'earliest': 1988, 'latest': 1996},
    'cordoba':  {'earliest': 1993, 'latest': 2009},
    'up':       {'earliest': 2011, 'latest': 2025},
    'ceed':     {'earliest': 2006, 'latest': 2025},
    '5_reihe':  {'earliest': 1972, 'latest': 2025},
    'yeti':     {'earliest': 2009, 'latest': 2017},
    'mii':      {'earliest': 2011, 'latest': 2025},     
    'rx_reihe': {'earliest': 1978, 'latest': 2012},     
    'modus':    {'earliest': 2004, 'latest': 2012},     
    'matiz':    {'earliest': 1998, 'latest': 2018},     
    'c1':       {'earliest': 2005, 'latest': 2025},     
    'rio':      {'earliest': 2000, 'latest': 2025},     
    'logan':    {'earliest': 2004, 'latest': 2025},     
    'spider':   {'earliest': 1996, 'latest': 2006},     
    'cuore':    {'earliest': 1977, 'latest': 2009},     
    's_max':    {'earliest': 2006, 'latest': 2015},
    'a2':       {'earliest': 1999, 'latest': 2005},
    'viano':    {'earliest': 2003, 'latest': 2014},
    'roomster': {'earliest': 2006, 'latest': 2015},
    'sl':       {'earliest': 1952, 'latest': 2011},
    'santa':    {'earliest': 1999, 'latest': 2013},
    'ptcruiser':{'earliest': 2000, 'latest': 2010},
    'exeo':     {'earliest': 2008, 'latest': 2013},
    '159':      {'earliest': 2005, 'latest': 2011},
    'juke':     {'earliest': 2010, 'latest': 2025},
    'carisma':  {'earliest': 1995, 'latest': 2006},
    'accord':   {'earliest': 1976, 'latest': 2025},
    'lanos':    {'earliest': 1997, 'latest': 2009},
    'phaeton':  {'earliest': 2002, 'latest': 2016},
    'verso':    {'earliest': 2001, 'latest': 2018},
    'rav':      {'earliest': 1994, 'latest': 2018},
    'picanto':  {'earliest': 2003, 'latest': 2025},
    'boxster':  {'earliest': 1996, 'latest': 2025},
    'kalos':    {'earliest': 2002, 'latest': 2011},
    'superb':   {'earliest': 2001, 'latest': 2025},
    'alhambra': {'earliest': 1996, 'latest': 2010},
}

 

for reg in reg_cols:
    earliest = reg_years[reg]['earliest']
    latest = reg_years[reg]['latest']
    # Early
    early = (df['model'] == reg) & (df['registrationyear'] < earliest)
    df.loc[early,['registration_correction']] = "Y: too early"
    # Late
    late = (df['model'] == reg) & (df['registrationyear'] > latest)
    df.loc[late,['registration_correction']] = "Y: too late"
    # Acceptable Range
    ar = (df['model'] == reg) & (df['registrationyear'] >= earliest) & (df['registrationyear'] <= latest)
    df.loc[ar,['registration_correction']] = "N"


df['registration_correction'].isna().sum()
Out[15]:
58277
In [16]:
remainder_models = df[(df['registration_correction'].isna()) & (df['model'].notna())]
remainder_models['model'].unique()
Out[16]:
array(['other', 'roadster', 'ypsilon', 'cayenne', 'galant', 'justy', '90',
       'sirion', 'crossfire', 'agila', 'duster', 'cr_reihe', 'v50',
       'c_reihe', 'v_klasse', 'c5', 'aygo', 'cc', 'carnival', 'fusion',
       '911', 'm_reihe', 'cl', '300c', 'spark', 'kuga', 'x_type',
       'ducato', 's_type', 'x_trail', 'toledo', 'altea', 'voyager',
       'calibra', 'bravo', 'antara', 'tucson', 'citigo', 'jimny',
       'wrangler', 'lybra', 'q7', 'lancer', 'captiva', 'c2', 'discovery',
       'freelander', 'sandero', 'note', '900', 'cherokee', 'clubman',
       'samara', 'defender', 'cx_reihe', 'legacy', 'pajero', 'auris',
       'niva', 's60', 'nubira', 'vivaro', 'g_klasse', 'lodgy', '850',
       'range_rover', 'q3', 'serie_2', 'glk', 'charade', 'croma',
       'outlander', 'doblo', 'musa', 'move', '9000', 'v60', '145', 'aveo',
       '200', 'b_max', 'range_rover_sport', 'terios', 'rangerover', 'q5',
       'range_rover_evoque', 'materia', 'delta', 'gl', 'kalina', 'amarok',
       'elefantino', 'i3', 'kappa', 'serie_3', 'serie_1'], dtype=object)
In [17]:
r_cols = [
        'roadster', 'ypsilon', 'cayenne', 'galant',
       'justy', '90', 'sirion', 'crossfire', 'agila', 'duster',
       'cr_reihe', 'v50', 'c_reihe', 'v_klasse', 'c5', 'aygo', 'cc',
       'carnival', 'fusion', '911', 'm_reihe', 'cl', '300c', 'spark', 'kuga', 'x_type',
       'ducato', 's_type', 'x_trail', 'toledo', 'altea', 'voyager',
       'calibra', 'bravo', 'antara', 'tucson', 'citigo', 'jimny',
       'wrangler', 'lybra', 'q7', 'lancer', 'captiva', 'c2', 'discovery',
       'freelander', 'sandero', 'note', '900', 'cherokee', 'clubman',
       'samara', 'defender', 'cx_reihe', 'legacy', 'pajero', 'auris',
       'niva', 's60', 'nubira', 'vivaro', 'g_klasse', 'lodgy', '850',
       'range_rover', 'q3', 'serie_2', 'glk', 'charade', 'croma',
       'outlander', 'doblo', 'musa', 'move', '9000', 'v60', '145', 'aveo',
       '200', 'b_max', 'range_rover_sport', 'terios', 'rangerover', 'q5',
       'range_rover_evoque', 'materia', 'delta', 'gl', 'kalina', 'amarok',
       'elefantino', 'i3', 'kappa', 'serie_3', 'serie_1'
]


r_years = {
    'roadster':    {'earliest': 1998, 'latest': 2025},
    'ypsilon':     {'earliest': 1995, 'latest': 2025},
    'cayenne':     {'earliest': 2002, 'latest': 2025},
    'galant':      {'earliest': 1969, 'latest': 2012},
    'justy':       {'earliest': 1984, 'latest': 2010},
    '90':          {'earliest': 1984, 'latest': 1987},
    'sirion':      {'earliest': 1995, 'latest': 2025},
    'crossfire':   {'earliest': 2003, 'latest': 2008},
    'agila':       {'earliest': 2000, 'latest': 2014},
    'duster':      {'earliest': 2010, 'latest': 2025},
    'cr_reihe':    {'earliest': 1995, 'latest': 2025},
    'v50':         {'earliest': 2004, 'latest': 2012},
    'c_reihe':     {'earliest': 1993, 'latest': 2025},
    'v_klasse':    {'earliest': 1996, 'latest': 2025},
    'c5':          {'earliest': 2001, 'latest': 2017},
    'aygo':        {'earliest': 2005, 'latest': 2025},
    'cc':          {'earliest': 2008, 'latest': 2017},
    'carnival':    {'earliest': 1998, 'latest': 2025},
    'fusion':      {'earliest': 2002, 'latest': 2020},
    '911':         {'earliest': 1963, 'latest': 2025},
    'm_reihe':     {'earliest': 1976, 'latest': 2025},
    'cl':          {'earliest': 1996, 'latest': 2014},
    '300c':        {'earliest': 2005, 'latest': 2020},
    'spark':       {'earliest': 1998, 'latest': 2025},
    'kuga':        {'earliest': 2008, 'latest': 2025},
    'x_type':      {'earliest': 2001, 'latest': 2009},
    'ducato':      {'earliest': 1981, 'latest': 2025},
    's_type':      {'earliest': 1998, 'latest': 2008},
    'x_trail':     {'earliest': 2000, 'latest': 2025},
    'toledo':      {'earliest': 1991, 'latest': 2013},
    'altea':       {'earliest': 2004, 'latest': 2015},
    'voyager':     {'earliest': 1984, 'latest': 2025},
    'calibra':     {'earliest': 1989, 'latest': 1997},
    'bravo':       {'earliest': 1995, 'latest': 2006},
    'antara':      {'earliest': 2006, 'latest': 2025},
    'tucson':      {'earliest': 2004, 'latest': 2025},
    'citigo':      {'earliest': 2011, 'latest': 2025},
    'jimny':       {'earliest': 1983, 'latest': 2025},
    'wrangler':    {'earliest': 1986, 'latest': 2025},
    'lybra':       {'earliest': 1998, 'latest': 2005},
    'q7':          {'earliest': 2005, 'latest': 2025},
    'lancer':      {'earliest': 1973, 'latest': 2017},
    'captiva':     {'earliest': 2006, 'latest': 2025},
    'c2':          {'earliest': 2003, 'latest': 2009},
    'discovery':   {'earliest': 1989, 'latest': 2025},
    'freelander':  {'earliest': 1997, 'latest': 2014},
    'sandero':     {'earliest': 2007, 'latest': 2025},
    'note':        {'earliest': 2004, 'latest': 2025},
    '900':         {'earliest': 1978, 'latest': 1993},
    'cherokee':    {'earliest': 1984, 'latest': 2025},
    'clubman':     {'earliest': 2007, 'latest': 2025},
    'samara':      {'earliest': 1984, 'latest': 2001},
    'defender':    {'earliest': 1983, 'latest': 2016},
    'cx_reihe':    {'earliest': 2006, 'latest': 2011},
    'legacy':      {'earliest': 1989, 'latest': 2025},
    'pajero':      {'earliest': 1982, 'latest': 2021},
    'auris':       {'earliest': 2006, 'latest': 2025},
    'niva':        {'earliest': 1977, 'latest': 2025},
    's60':         {'earliest': 2000, 'latest': 2025},
    'nubira':      {'earliest': 1997, 'latest': 2008},
    'vivaro':      {'earliest': 2001, 'latest': 2025},
    'g_klasse':    {'earliest': 1979, 'latest': 2025},
    'lodgy':       {'earliest': 2012, 'latest': 2025},
    '850':         {'earliest': 1991, 'latest': 1997},
    'range_rover': {'earliest': 1970, 'latest': 2025},
    'q3':          {'earliest': 2011, 'latest': 2025},
    'serie_2':     {'earliest': 1958, 'latest': 2025},
    'glk':         {'earliest': 2008, 'latest': 2015},
    'charade':     {'earliest': 1977, 'latest': 2000},
    'croma':       {'earliest': 1985, 'latest': 2010},
    'outlander':   {'earliest': 2001, 'latest': 2025},
    'doblo':       {'earliest': 2000, 'latest': 2025},
    'musa':        {'earliest': 2004, 'latest': 2012},
    'move':        {'earliest': 1998, 'latest': 2002},
    '9000':        {'earliest': 1985, 'latest': 1998},
    'v60':         {'earliest': 2010, 'latest': 2025},
    '145':         {'earliest': 1994, 'latest': 2000},
    'aveo':        {'earliest': 2002, 'latest': 2011},
    '200':         {'earliest': 1980, 'latest': 2007},
    'b_max':       {'earliest': 2007, 'latest': 2012},
    'range_rover_sport': {'earliest': 2005, 'latest': 2025},
    'terios':      {'earliest': 1997, 'latest': 2017},
    'rangerover':  {'earliest': 1970, 'latest': 2025},
    'q5':          {'earliest': 2008, 'latest': 2025},
    'range_rover_evoque':{'earliest': 2011, 'latest': 2025},
    'materia':     {'earliest': 2007, 'latest': 2012},
    'delta':       {'earliest': 1979, 'latest': 2014},
    'gl':          {'earliest': 2006, 'latest': 2015},
    'kalina':      {'earliest': 2004, 'latest': 2018},
    'amarok':      {'earliest': 2010, 'latest': 2025},
    'elefantino':  {'earliest': 1963, 'latest': 2011},
    'i3':          {'earliest': 2013, 'latest': 2025},
    'kappa':       {'earliest': 1994, 'latest': 2001},
    'serie_3':     {'earliest': 1975, 'latest': 2025},
    'serie_1':     {'earliest': 2004, 'latest': 2025}
}

for r in r_cols:
    earliest = r_years[r]['earliest']
    latest = r_years[r]['latest']
    # early
    e = (df['model'] == r) & (df['registrationyear'] < earliest)
    df.loc[e,['registration_correction']] = "Y: too early"
    #late
    l = (df['model'] == r) & (df['registrationyear'] > latest)
    df.loc[l,['registration_correction']] = "Y: too late"
    # acceptable
    a = (df['model'] == r) & (df['registrationyear'] >= earliest) & (df['registrationyear'] <= latest)
    df.loc[a,['registration_correction']] = "N"

df['registration_correction'].isna().sum()
Out[17]:
44028
In [18]:
remainder_models = df[(df['registration_correction'].isna()) & (df['model'].notna())]
remainder_models['model'].value_counts()
Out[18]:
other    24421
Name: model, dtype: int64
In [19]:
# Mark incorrect registration years as Nan
other_reg_less = (df['model'] == 'other') & (df['registrationyear'] < 1893)
other_reg_more = (df['model'] == 'other') & (df['registrationyear'] > 2025)
df.loc[other_reg_more,['registration_correction']] = "Y: too late (other)"
df.loc[other_reg_less,['registration_correction']] = "Y: too early (other)"
df.loc[other_reg_more,['registrationyear']] = np.nan
df.loc[other_reg_less,['registrationyear']] = np.nan
In [20]:
# Mark remaining incorrect registration years as Nan
incorrect = ((df['registrationyear'] < 1893) | (df['registrationyear'] > 2025))
df.loc[incorrect,['registrationyear']] = np.nan
In [21]:
del incorrect
In [22]:
df[(df['registration_correction'].isna()) & (df['model'] == 'other')].value_counts(subset = 'brand')

brand_reg_cols = [
    'mercedes_benz', 'citroen', 'fiat', 'ford', 'hyundai', 'peugeot', 'opel', 
    'suzuki', 'audi', 'mazda', 'renault', 'chevrolet', 'toyota', 'mitsubishi', 
    'volkswagen', 'nissan', 'volvo', 'alfa_romeo', 'kia', 'rover', 'chrysler', 
    'saab', 'honda', 'skoda', 'bmw', 'jaguar', 'porsche', 'jeep', 'seat', 
    'daihatsu', 'lancia', 'mini', 'daewoo', 'trabant', 'smart', 'subaru', 
    'lada', 'dacia', 'land_rover'
]

brand_registration_years = {
    'mercedes_benz': {'earliest': 1926, 'latest': 2025},
    'citroen': {'earliest': 1919, 'latest': 2025},
    'fiat': {'earliest': 1899, 'latest': 2025},
    'ford': {'earliest': 1903, 'latest': 2025},
    'hyundai': {'earliest': 1967, 'latest': 2025},
    'peugeot': {'earliest': 1889, 'latest': 2025},
    'opel': {'earliest': 1899, 'latest': 2025},
    'suzuki': {'earliest': 1955, 'latest': 2025},
    'audi': {'earliest': 1910, 'latest': 2025},
    'mazda': {'earliest': 1931, 'latest': 2025},
    'renault': {'earliest': 1898, 'latest': 2025},
    'chevrolet': {'earliest': 1911, 'latest': 2025},
    'toyota': {'earliest': 1936, 'latest': 2025},
    'mitsubishi': {'earliest': 1917, 'latest': 2025},
    'volkswagen': {'earliest': 1937, 'latest': 2025},
    'nissan': {'earliest': 1933, 'latest': 2025},
    'volvo': {'earliest': 1927, 'latest': 2025},
    'alfa_romeo': {'earliest': 1910, 'latest': 2025},
    'kia': {'earliest': 1944, 'latest': 2025},
    'rover': {'earliest': 1904, 'latest': 2005},
    'chrysler': {'earliest': 1925, 'latest': 2025},
    'saab': {'earliest': 1947, 'latest': 2011},
    'honda': {'earliest': 1963, 'latest': 2025},
    'skoda': {'earliest': 1905, 'latest': 2025},
    'bmw': {'earliest': 1928, 'latest': 2025},
    'jaguar': {'earliest': 1935, 'latest': 2025},
    'porsche': {'earliest': 1948, 'latest': 2025},
    'jeep': {'earliest': 1941, 'latest': 2025},
    'seat': {'earliest': 1950, 'latest': 2025},
    'daihatsu': {'earliest': 1951, 'latest': 2025},
    'lancia': {'earliest': 1908, 'latest': 2025},
    'mini': {'earliest': 1959, 'latest': 2025},
    'daewoo': {'earliest': 1937, 'latest': 2011},
    'trabant': {'earliest': 1957, 'latest': 1991},
    'smart': {'earliest': 1998, 'latest': 2025},
    'subaru': {'earliest': 1954, 'latest': 2025},
    'lada': {'earliest': 1966, 'latest': 2025},
    'dacia': {'earliest': 1966, 'latest': 2025},
    'land_rover': {'earliest': 1948, 'latest': 2025}
}

for brand in brand_reg_cols:
    other = (df['model'] == 'other')
    reg_corr_na = (df['registration_correction'].isna())
    earliest = brand_registration_years[brand]['earliest']
    latest = brand_registration_years[brand]['latest']
    # too early
    te = (df['brand'] == brand) & other & reg_corr_na & (df['registrationyear'] < earliest)
    df.loc[te,['registration_correction']] = "Y: too early (other)"
    # too late
    tl = (df['brand'] == brand) & other & reg_corr_na & (df['registrationyear'] > latest)
    df.loc[tl,['registration_correction']] = "Y: too late (other)"
    # acceptable
    accept = (df['brand'] == brand) & other & reg_corr_na & (df['registrationyear'] >= earliest) \
    & (df['registrationyear'] <= latest)
    df.loc[accept,['registration_correction']] = "N"

df['registration_correction'].isna().sum()
Out[22]:
19607
In [23]:
for brand in brand_reg_cols:
    reg_corr_na = (df['registration_correction'].isna())
    earliest = brand_registration_years[brand]['earliest']
    latest = brand_registration_years[brand]['latest']
    # too early
    ear = (df['brand'] == brand) & reg_corr_na & (df['registrationyear'] < earliest)
    df.loc[ear,['registration_correction']] = "Y: too early"
    # too late
    lat = (df['brand'] == brand) & reg_corr_na & (df['registrationyear'] > latest)
    df.loc[lat,['registration_correction']] = "Y: too late"
    # acceptable
    apt = (df['brand'] == brand) & reg_corr_na & (df['registrationyear'] >= earliest) \
    & (df['registrationyear'] <= latest)
    df.loc[apt,['registration_correction']] = "N"

df['registration_correction'].isna().sum()
Out[23]:
3338
In [24]:
df[(df['registration_correction'].isna()) & ((df['registrationyear'] < 1893) | (df['registrationyear'] > 2025))]

# Mark the remaining Nan values in registration_correction as "N"
df.loc[(df['registration_correction'].isna()), ['registration_correction']] = "N"
In [25]:
df['registration_correction'].value_counts()

other_l = df['registration_correction'] == 'Y: too late (other)'
other_e = df['registration_correction'] == 'Y: too early (other)'

df.loc[other_l,['registration_correction']] = "Y: too late"
df.loc[other_e,['registration_correction']] = "Y: too early"

df['registration_correction'].value_counts()
Out[25]:
N               349367
Y: too late       2950
Y: too early      1955
Y: missing          97
Name: registration_correction, dtype: int64
In [26]:
del other_l
del other_e
In [27]:
display(df[(df['registration_correction'] == 'Y: missing') & (df['brand'] != 'sonstige_autos')])

#pd.merge(df[(df['registration_correction'] == 'Y: missing') & (df['brand'] != 'sonstige_autos')],on = 'index', how = 'left')

index = car_dates[car_dates['registrationyear'].notna()].index
display(index)
values = car_dates.loc[index, 'registrationyear'].values
display(values)
print("")
print("")
print("")
dictionary = dict(zip(index,values))



df.loc[index, ['registrationyear']] = values

display(df.loc[index])
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
622 16/03/2016 16:55 0 NaN NaN NaN 0 NaN 5000 0 NaN opel NaN 16/03/2016 00:00 0 44628 20/03/2016 16:44 Y: missing
18023 24/03/2016 08:57 1 NaN NaN NaN 0 NaN 5000 0 NaN volkswagen NaN 24/03/2016 00:00 0 50829 06/04/2016 22:45 Y: missing
24458 29/03/2016 19:50 50 small NaN manual 0 NaN 5000 1 NaN volkswagen yes 29/03/2016 00:00 0 91487 06/04/2016 05:46 Y: missing
32768 11/03/2016 17:53 1500 small NaN manual 75 NaN 100000 4 petrol smart NaN 11/03/2016 00:00 0 47055 05/04/2016 18:45 Y: missing
34332 01/04/2016 06:02 450 NaN NaN NaN 1800 NaN 5000 2 NaN mitsubishi no 01/04/2016 00:00 0 63322 01/04/2016 09:42 Y: missing
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
323443 26/03/2016 20:58 30 NaN NaN NaN 0 NaN 5000 0 NaN audi NaN 26/03/2016 00:00 0 37574 06/04/2016 12:17 Y: missing
325739 30/03/2016 11:36 400 NaN NaN NaN 0 NaN 5000 0 NaN mercedes_benz NaN 30/03/2016 00:00 0 8060 01/04/2016 06:16 Y: missing
333004 20/03/2016 14:57 0 suv NaN manual 0 NaN 5000 0 NaN toyota NaN 20/03/2016 00:00 0 48683 20/03/2016 14:57 Y: missing
333488 23/03/2016 01:36 0 NaN NaN NaN 0 NaN 10000 0 NaN bmw NaN 23/03/2016 00:00 0 32689 23/03/2016 08:47 Y: missing
343083 01/04/2016 08:51 1 other NaN NaN 0 NaN 5000 1 other volkswagen NaN 01/04/2016 00:00 0 18273 07/04/2016 05:44 Y: missing

61 rows × 17 columns

Int64Index([   622,  15147,  15870,  17346,  20159,  34332,  38875,  41170,
             46935,  55605,  60017,  60079,  66198,  70847,  78128,  84841,
             87522,  91869,  94926, 110123, 112768, 118047, 122692, 128677,
            129221, 129980, 130474, 135865, 139360, 139756, 146323, 146507,
            148570, 149151, 151228, 151725, 158283, 167937, 172242, 174531,
            177353, 183779, 184598, 200525, 202258, 206219, 214830, 215678,
            220638, 221736, 224832, 226526, 230741, 233631, 234896, 242233,
            243656, 244092, 244254, 248137, 252476, 255866, 260401, 268091,
            272024, 278390, 278517, 290609, 295172, 316487, 323443, 325739,
            333488, 340548, 348830],
           dtype='int64')
array([1111, 9999, 3200, 8888, 4100, 1800, 1234, 5300, 6000, 1000, 1000,
       9999, 1000, 1255, 1000, 3800, 4800, 1000, 7000, 1000, 1000, 6000,
       2500, 9999, 1000, 1000, 9450, 1000, 1800, 2500, 1234, 5000, 1688,
       9999, 9999, 1000, 6000, 9999, 2800, 1253, 9999, 1000, 9999, 9999,
       9000, 5600, 1600, 1111, 2222, 1039, 9999, 3000, 1000, 1000, 9996,
       1000, 1000, 1000, 3000, 6000, 2222, 2800, 8455, 9999, 5000, 4500,
       1500, 1500, 9229, 5000, 1000, 1000, 9999, 3500, 1000])


datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
622 16/03/2016 16:55 0 NaN 1111.0 NaN 0 NaN 5000 0 NaN opel NaN 16/03/2016 00:00 0 44628 20/03/2016 16:44 Y: missing
15147 14/03/2016 00:52 0 NaN 9999.0 NaN 0 NaN 10000 0 NaN sonstige_autos NaN 13/03/2016 00:00 0 32689 21/03/2016 23:46 Y: missing
15870 02/04/2016 11:55 1700 NaN 3200.0 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 02/04/2016 00:00 0 33649 06/04/2016 09:46 Y: missing
17346 06/03/2016 16:06 6500 NaN 8888.0 NaN 0 NaN 10000 0 NaN sonstige_autos NaN 06/03/2016 00:00 0 55262 30/03/2016 20:46 Y: missing
20159 01/04/2016 21:57 1600 NaN 4100.0 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 01/04/2016 00:00 0 67686 05/04/2016 20:19 Y: missing
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
323443 26/03/2016 20:58 30 NaN 1000.0 NaN 0 NaN 5000 0 NaN audi NaN 26/03/2016 00:00 0 37574 06/04/2016 12:17 Y: missing
325739 30/03/2016 11:36 400 NaN 1000.0 NaN 0 NaN 5000 0 NaN mercedes_benz NaN 30/03/2016 00:00 0 8060 01/04/2016 06:16 Y: missing
333488 23/03/2016 01:36 0 NaN 9999.0 NaN 0 NaN 10000 0 NaN bmw NaN 23/03/2016 00:00 0 32689 23/03/2016 08:47 Y: missing
340548 02/04/2016 17:44 0 NaN 3500.0 manual 75 NaN 5000 3 petrol sonstige_autos NaN 02/04/2016 00:00 0 96465 04/04/2016 15:17 Y: missing
348830 22/03/2016 00:38 1 NaN 1000.0 NaN 1000 NaN 150000 0 NaN sonstige_autos NaN 21/03/2016 00:00 0 41472 05/04/2016 14:18 Y: missing

75 rows × 17 columns

In [28]:
ind97 = df.loc[[32768, 261138, 320335]]
ind95 = df.loc[[148233]]
ind96 = df.loc[[176475]]

df.loc[ind97.index,['registrationyear']] = 1997
df.loc[ind95.index,['registrationyear']] = 1995
df.loc[ind96.index,['registrationyear']] = 1996

df.loc[[32768, 261138, 320335, 148233, 176475]]
Out[28]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
32768 11/03/2016 17:53 1500 small 1997.0 manual 75 NaN 100000 4 petrol smart NaN 11/03/2016 00:00 0 47055 05/04/2016 18:45 Y: missing
261138 28/03/2016 11:56 500 wagon 1997.0 manual 90 NaN 150000 12 gasoline smart yes 28/03/2016 00:00 0 99310 06/04/2016 15:15 Y: missing
320335 15/03/2016 19:58 850 wagon 1997.0 auto 170 NaN 150000 0 NaN smart no 15/03/2016 00:00 0 4205 16/03/2016 17:51 Y: missing
148233 02/04/2016 21:57 1000 small 1995.0 manual 60 NaN 150000 0 petrol smart no 02/04/2016 00:00 0 6667 06/04/2016 22:46 Y: missing
176475 07/03/2016 09:52 1000 NaN 1996.0 auto 0 NaN 150000 0 NaN smart NaN 07/03/2016 00:00 0 3222 08/03/2016 03:46 Y: missing
In [29]:
del ind97
del ind95
del ind96
In [30]:
display(df['registrationyear'].isna().sum())

index_rc = df[(df['registration_correction'] == "Y: missing") & (df['registrationyear'].isna())].index

display(index_rc)

index_replace = df1.loc[index_rc]
display(index_replace)

df.loc[index_rc,['registrationyear']] = 1910
113
Int64Index([ 18023,  24458,  64345,  69320,  90011, 150021, 154571, 155833,
            166750, 188748, 190238, 212091, 225151, 273431, 321782, 333004,
            343083],
           dtype='int64')
DateCrawled Price VehicleType RegistrationYear Gearbox Power Model Mileage RegistrationMonth FuelType Brand NotRepaired DateCreated NumberOfPictures PostalCode LastSeen
18023 24/03/2016 08:57 1 NaN 1910 NaN 0 NaN 5000 0 NaN volkswagen NaN 24/03/2016 00:00 0 50829 06/04/2016 22:45
24458 29/03/2016 19:50 50 small 1910 manual 0 NaN 5000 1 NaN volkswagen yes 29/03/2016 00:00 0 91487 06/04/2016 05:46
64345 11/03/2016 09:37 160 NaN 1910 NaN 0 NaN 5000 0 NaN hyundai NaN 11/03/2016 00:00 0 52525 24/03/2016 10:15
69320 11/03/2016 22:53 20 NaN 1910 NaN 0 NaN 5000 0 NaN trabant NaN 11/03/2016 00:00 0 6618 25/03/2016 16:16
90011 03/04/2016 09:02 5000 NaN 1910 NaN 0 NaN 150000 0 petrol bmw NaN 03/04/2016 00:00 0 21079 07/04/2016 10:45
150021 11/03/2016 22:56 20 NaN 1910 NaN 0 NaN 5000 0 NaN trabant NaN 11/03/2016 00:00 0 6618 26/03/2016 06:46
154571 24/03/2016 09:57 0 NaN 1910 NaN 0 NaN 5000 0 NaN jeep NaN 24/03/2016 00:00 0 24622 27/03/2016 05:46
155833 11/03/2016 22:37 15 NaN 1910 NaN 0 NaN 5000 0 NaN trabant NaN 11/03/2016 00:00 0 90491 25/03/2016 11:18
166750 17/03/2016 19:40 99 NaN 1910 NaN 0 NaN 150000 0 NaN subaru yes 17/03/2016 00:00 0 21635 17/03/2016 19:40
188748 24/03/2016 13:46 0 NaN 1910 NaN 0 NaN 5000 0 NaN bmw NaN 24/03/2016 00:00 0 1279 07/04/2016 05:16
190238 11/03/2016 23:49 15 NaN 1910 NaN 0 NaN 5000 0 NaN trabant NaN 11/03/2016 00:00 0 6618 26/03/2016 06:46
212091 02/04/2016 21:48 200 NaN 1910 NaN 0 NaN 5000 0 NaN trabant NaN 02/04/2016 00:00 0 2627 06/04/2016 22:44
225151 09/03/2016 17:48 0 NaN 1910 NaN 0 NaN 150000 0 NaN trabant NaN 09/03/2016 00:00 0 26676 09/03/2016 17:48
273431 09/03/2016 13:50 2500 NaN 1910 NaN 0 NaN 5000 0 NaN volkswagen NaN 09/03/2016 00:00 0 59320 15/03/2016 14:46
321782 25/03/2016 18:50 0 small 1910 manual 600 NaN 150000 5 NaN volkswagen yes 25/03/2016 00:00 0 35764 25/03/2016 21:27
333004 20/03/2016 14:57 0 suv 1910 manual 0 NaN 5000 0 NaN toyota NaN 20/03/2016 00:00 0 48683 20/03/2016 14:57
343083 01/04/2016 08:51 1 other 1910 NaN 0 NaN 5000 1 other volkswagen NaN 01/04/2016 00:00 0 18273 07/04/2016 05:44
In [31]:
del index_rc
gc.collect()
Out[31]:
0
In [32]:
display(df[df['registration_correction'] == "Y: missing"])
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
622 16/03/2016 16:55 0 NaN 1111.0 NaN 0 NaN 5000 0 NaN opel NaN 16/03/2016 00:00 0 44628 20/03/2016 16:44 Y: missing
15147 14/03/2016 00:52 0 NaN 9999.0 NaN 0 NaN 10000 0 NaN sonstige_autos NaN 13/03/2016 00:00 0 32689 21/03/2016 23:46 Y: missing
15870 02/04/2016 11:55 1700 NaN 3200.0 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 02/04/2016 00:00 0 33649 06/04/2016 09:46 Y: missing
17346 06/03/2016 16:06 6500 NaN 8888.0 NaN 0 NaN 10000 0 NaN sonstige_autos NaN 06/03/2016 00:00 0 55262 30/03/2016 20:46 Y: missing
18023 24/03/2016 08:57 1 NaN 1910.0 NaN 0 NaN 5000 0 NaN volkswagen NaN 24/03/2016 00:00 0 50829 06/04/2016 22:45 Y: missing
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
333004 20/03/2016 14:57 0 suv 1910.0 manual 0 NaN 5000 0 NaN toyota NaN 20/03/2016 00:00 0 48683 20/03/2016 14:57 Y: missing
333488 23/03/2016 01:36 0 NaN 9999.0 NaN 0 NaN 10000 0 NaN bmw NaN 23/03/2016 00:00 0 32689 23/03/2016 08:47 Y: missing
340548 02/04/2016 17:44 0 NaN 3500.0 manual 75 NaN 5000 3 petrol sonstige_autos NaN 02/04/2016 00:00 0 96465 04/04/2016 15:17 Y: missing
343083 01/04/2016 08:51 1 other 1910.0 NaN 0 NaN 5000 1 other volkswagen NaN 01/04/2016 00:00 0 18273 07/04/2016 05:44 Y: missing
348830 22/03/2016 00:38 1 NaN 1000.0 NaN 1000 NaN 150000 0 NaN sonstige_autos NaN 21/03/2016 00:00 0 41472 05/04/2016 14:18 Y: missing

97 rows × 17 columns

In [33]:
for brand in brand_reg_cols:
    y_miss = (df['registration_correction'] == 'Y: missing')
    earliest = brand_registration_years[brand]['earliest']
    latest = brand_registration_years[brand]['latest']
    # too early
    earl = (df['brand'] == brand) & y_miss & (df['registrationyear'] < earliest)
    df.loc[earl,['registration_correction']] = "Y: too early"
    # too late
    late = (df['brand'] == brand) & y_miss & (df['registrationyear'] > latest)
    df.loc[late,['registration_correction']] = "Y: too late"
    # acceptable
    ap = (df['brand'] == brand) & y_miss & (df['registrationyear'] >= earliest) \
    & (df['registrationyear'] <= latest)
    df.loc[ap,['registration_correction']] = "N"
In [34]:
y_less = (df['registration_correction'] == "Y: missing") & (df['registrationyear'] < 1893)  

y_more = (df['registration_correction'] == "Y: missing") & (df['registrationyear'] > 2025)

acceptable = (df['registration_correction'] == "Y: missing") & (df['registrationyear'] > 1893) \
& (df['registrationyear'] < 2025)

df.loc[y_less, ['registration_correction']] = "Y: too early"
df.loc[y_more, ['registration_correction']] = "Y: too late"
df.loc[acceptable,['registration_correction']] = "N"
In [35]:
del y_less
del y_more
del acceptable
gc.collect()
Out[35]:
0
In [36]:
display(df['registration_correction'].value_counts())
N               349367
Y: too late       2992
Y: too early      2010
Name: registration_correction, dtype: int64
In [37]:
index = df[df['registrationyear'].isna()].index
values = df1['RegistrationYear'].loc[index]
display(values)

df.loc[index,['registrationyear']] = values
df.loc[index]
12946     5000
16062     1000
17271     9999
18224     1500
18259     2200
          ... 
335727    7500
338829    3000
340759    1600
341791    3000
351682    1800
Name: RegistrationYear, Length: 96, dtype: int64
Out[37]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
12946 29/03/2016 18:39 49 NaN 5000.0 NaN 0 golf 5000 12 NaN volkswagen NaN 29/03/2016 00:00 0 74523 06/04/2016 04:16 Y: too late
16062 29/03/2016 23:42 190 NaN 1000.0 NaN 0 mondeo 5000 0 NaN ford NaN 29/03/2016 00:00 0 47166 06/04/2016 10:44 Y: too early
17271 23/03/2016 16:43 700 NaN 9999.0 NaN 0 other 10000 0 NaN opel NaN 23/03/2016 00:00 0 21769 05/04/2016 20:16 Y: too late
18224 09/03/2016 17:49 7999 NaN 1500.0 manual 224 impreza 5000 3 NaN subaru NaN 09/03/2016 00:00 0 53577 15/03/2016 05:15 Y: too early
18259 16/03/2016 20:37 300 NaN 2200.0 NaN 0 twingo 5000 12 NaN renault NaN 16/03/2016 00:00 0 45307 07/04/2016 06:45 Y: too late
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
335727 09/03/2016 07:01 0 NaN 7500.0 manual 0 other 10000 0 petrol mini no 09/03/2016 00:00 0 9669 19/03/2016 19:44 Y: too late
338829 24/03/2016 19:49 50 NaN 3000.0 NaN 3000 golf 100000 6 NaN volkswagen yes 24/03/2016 00:00 0 23992 03/04/2016 13:17 Y: too late
340759 04/04/2016 23:55 700 NaN 1600.0 manual 1600 a3 150000 4 petrol audi no 04/04/2016 00:00 0 86343 05/04/2016 06:44 Y: too early
341791 28/03/2016 17:37 1 NaN 3000.0 NaN 0 zafira 5000 0 NaN opel NaN 28/03/2016 00:00 0 26624 02/04/2016 22:17 Y: too late
351682 12/03/2016 00:57 11500 NaN 1800.0 NaN 16 other 5000 6 petrol fiat NaN 11/03/2016 00:00 0 16515 05/04/2016 19:47 Y: too early

96 rows × 17 columns

Duplicate Handling¶

In [38]:
display(df.duplicated().sum())
df[df.duplicated()]
df = df.drop_duplicates()
df.duplicated().sum()
262
Out[38]:
0

Missing Values¶

In [39]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 354107 entries, 0 to 354368
Data columns (total 17 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   datecrawled              354107 non-null  object 
 1   price                    354107 non-null  int64  
 2   vehicletype              316623 non-null  object 
 3   registrationyear         354107 non-null  float64
 4   gearbox                  334277 non-null  object 
 5   power                    354107 non-null  int64  
 6   model                    334407 non-null  object 
 7   mileage                  354107 non-null  int64  
 8   registrationmonth        354107 non-null  int64  
 9   fueltype                 321218 non-null  object 
 10  brand                    354107 non-null  object 
 11  notrepaired              282962 non-null  object 
 12  datecreated              354107 non-null  object 
 13  numberofpictures         354107 non-null  int64  
 14  postalcode               354107 non-null  int64  
 15  lastseen                 354107 non-null  object 
 16  registration_correction  354107 non-null  object 
dtypes: float64(1), int64(6), object(10)
memory usage: 48.6+ MB

Missing Values:

Column Percent Missing
Vehicle Type: 10.586 %
GearBox: 5.600 %
Model: 5.564 %
FuelType: 9.288 %
NotReparied: 20.091 %
In [40]:
# Percent Missing
print("Percent Missing")
print("===============")
vt = 354107 - 316623
vtp = (vt/354107) * 100
print(f"Vehicle Type: \n{vtp:.3f} %")
print("")
gb = 354107 - 334277
gbp = (gb/354107) * 100
print(f"GearBox: \n{gbp:.3f} %")
print("")
m = 354107 - 334406
mp = (m/354107) * 100
print(f"Model: \n{mp:.3f} %")
print("")
ft = 354107 - 321218
ftp = (ft/354107) * 100
print(f"FuelType: \n{ftp:.3f} %")
print("")
nr = 354107 - 282962
nrp = (nr/354107) * 100
print(f"NotReparied: \n{nrp:.3f} %")
Percent Missing
===============
Vehicle Type: 
10.586 %

GearBox: 
5.600 %

Model: 
5.564 %

FuelType: 
9.288 %

NotReparied: 
20.091 %
In [41]:
# Inpect Model Column
model_col = df[(df['model'].isna()) & (df['brand'] != 'sonstige_autos')]
model_col_p0 = model_col[model_col['power'] == 0]

brand_p0 = model_col_p0['brand'].value_counts()
brand_p0.plot(kind='bar', x='brand', y='power', figsize=(12,6))
plt.title('0hp powered vehicles by brand (model info. missing)')
plt.xlabel('Brand')
plt.ylabel('0hp Power Frequency')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

display(brand_p0)
display(model_col_p0['brand'].unique())

brand_p0_rows_bottom = ['mitsubishi', 'skoda', 'chevrolet', 'kia', 'porsche', 'chrysler', 'volvo', 'rover', 'daihatsu',
           'daewoo', 'subaru', 'mini', 'lada', 'dacia', 'jeep', 'jaguar', 'lancia', 'saab', 'land_rover']
      

brand_p0_rows_mbottom = ['citroen', 'seat', 'hyundai', 'nissan', 'trabant', 'suzuki', 'toyota', 
                         'alfa_romeo', 'honda']

brand_p0_rows_middle = ['ford', 'audi', 'peugeot', 'renault', 'fiat', 'mazda', 'smart']

brand_p0_rows_top = ['bmw', 'opel', 'mercedes_benz']

brand_p0_rows_vw = ['volkswagen']





# Separate by brand and known model
model_col_notna_top_bottom = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
    & (df['brand'].isin(brand_p0_rows_bottom))]

model_col_notna_top_mbottom = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
    & (df['brand'].isin(brand_p0_rows_mbottom))]

model_col_notna_top_middle = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
    & (df['brand'].isin(brand_p0_rows_middle))]

model_col_notna_top = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
    & (df['brand'].isin(brand_p0_rows_top))]

model_col_notna_top_vw = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
    & (df['brand'].isin(brand_p0_rows_vw))]

# Use known model to find NaN values
# Least missing
brand_p0_notna_top_bottom = model_col_notna_top_bottom[['brand','model']].value_counts().sort_index()

# Middle Least missing
brand_p0_notna_top_mbottom = model_col_notna_top_mbottom[['brand','model']].value_counts().sort_index()

# Middle missing
brand_p0_notna_top_middle = model_col_notna_top_middle[['brand','model']].value_counts().sort_index()

# Top Missing
brand_p0_notna_top = model_col_notna_top[['brand','model']].value_counts().sort_index()

# Volkswagen
brand_p0_notna_top_vw = model_col_notna_top_vw[['brand','model']].value_counts()


# Least Missing
with pd.option_context('display.max_rows', None):
    display(brand_p0_notna_top_bottom)
brand_p0_notna_top_bottom.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Least Missing/ Bottom Tier (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

# Next Least Missing
display(brand_p0_notna_top_mbottom)
brand_p0_notna_top_mbottom.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Least Missing Ext/ Bottom Tier 2 (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

# Middle Missing
with pd.option_context('display.max_rows', None):
    display(brand_p0_notna_top_middle)
brand_p0_notna_top_middle.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Middle Missing (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

# Top Missing
display(brand_p0_notna_top)
brand_p0_notna_top.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Top Missing (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

# Volkswagen
display(brand_p0_notna_top_vw)
brand_p0_notna_top_vw.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Volkswagen (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
No description has been provided for this image
volkswagen       990
bmw              536
opel             512
mercedes_benz    429
ford             341
audi             325
peugeot          272
renault          254
fiat             182
mazda            114
smart            100
citroen           78
seat              73
hyundai           68
nissan            63
trabant           58
suzuki            50
toyota            47
alfa_romeo        42
honda             40
mitsubishi        38
skoda             38
chevrolet         34
kia               27
porsche           25
chrysler          25
volvo             25
rover             23
daihatsu          19
daewoo            16
subaru            12
mini              12
dacia              7
lada               7
jaguar             6
jeep               6
lancia             5
saab               3
land_rover         2
Name: brand, dtype: int64
array(['volkswagen', 'renault', 'mitsubishi', 'bmw', 'peugeot', 'audi',
       'volvo', 'chevrolet', 'trabant', 'opel', 'smart', 'nissan',
       'suzuki', 'mercedes_benz', 'mazda', 'seat', 'fiat', 'citroen',
       'ford', 'skoda', 'kia', 'chrysler', 'daewoo', 'alfa_romeo',
       'rover', 'porsche', 'dacia', 'honda', 'lada', 'subaru', 'hyundai',
       'toyota', 'mini', 'jaguar', 'daihatsu', 'saab', 'land_rover',
       'lancia', 'jeep'], dtype=object)
brand       model            
chevrolet   aveo                   8
            captiva                8
            matiz                 41
            other                129
            spark                  5
chrysler    300c                  17
            crossfire              6
            grand                  8
            other                 45
            ptcruiser             34
            voyager               65
dacia       duster                13
            lodgy                  5
            logan                 33
            other                  1
            sandero               12
daewoo      kalos                 16
            lanos                 18
            matiz                 39
            nubira                 7
            other                 12
daihatsu    charade                6
            cuore                 77
            materia                1
            move                  14
            other                 15
            sirion                13
            terios                 3
jaguar      other                 17
            s_type                10
            x_type                24
jeep        cherokee              29
            grand                 17
            other                 12
            wrangler              12
kia         carnival              52
            ceed                   9
            other                 63
            picanto               29
            rio                   33
            sorento               30
            sportage              17
lada        kalina                 4
            niva                  26
            other                 17
            samara                 8
lancia      delta                  4
            elefantino             1
            kappa                  2
            lybra                 11
            musa                   2
            other                 11
            ypsilon               21
land_rover  defender              13
            discovery              9
            freelander            27
            other                  4
            range_rover            7
            range_rover_sport      2
            serie_1                2
            serie_2                2
            serie_3                1
mini        clubman                6
            cooper                63
            one                   27
            other                 14
mitsubishi  carisma               56
            colt                  83
            galant                32
            lancer                35
            other                 92
            outlander             10
            pajero                21
porsche     911                   28
            boxster               15
            cayenne                8
            other                 37
rover       freelander             1
            other                 57
            rangerover             1
saab        900                   15
            9000                   2
            other                 12
skoda       citigo                 2
            fabia                142
            octavia              141
            other                 51
            roomster              11
            superb                10
            yeti                   1
subaru      forester               9
            impreza               21
            justy                 18
            legacy                14
            other                  4
volvo       850                   25
            c_reihe                6
            other                 55
            s60                    1
            v40                   97
            v50                    8
            v60                    2
            v70                   47
            xc_reihe               7
dtype: int64
No description has been provided for this image
brand       model   
alfa_romeo  145          11
            147          39
            156          58
            159          10
            other        39
            spider       23
citroen     berlingo     94
            c1           32
            c2           37
            c3           48
            c4           27
            c5           41
            other       285
honda       accord       30
            civic       130
            cr_reihe     14
            jazz         19
            other        38
hyundai     getz         60
            i_reihe      56
            other       140
            santa        18
            tucson       11
nissan      almera       73
            juke          4
            micra       301
            navara       14
            note          7
            other        79
            primera      86
            qashqai      26
            x_trail      20
seat        alhambra     27
            altea        15
            arosa       127
            cordoba      60
            ibiza       206
            leon         33
            other        31
            toledo       43
suzuki      grand         9
            jimny        19
            other       124
            swift        69
toyota      auris        15
            avensis      35
            aygo         41
            corolla      82
            other        94
            rav          26
            verso        16
            yaris        90
trabant     601         165
            other        38
dtype: int64
No description has been provided for this image
brand    model   
audi     100          33
         200           2
         80          212
         90           15
         a1           12
         a2           39
         a3          490
         a4          729
         a5           17
         a6          363
         a8           44
         other        46
         q3            1
         q5            2
         q7           17
         tt           32
fiat     500          57
         bravo        38
         croma         7
         doblo        32
         ducato       98
         other       244
         panda        86
         punto       518
         seicento    129
         stilo        71
ford     b_max         1
         c_max        36
         escort      143
         fiesta      726
         focus       537
         fusion       22
         galaxy      148
         ka          536
         kuga         11
         mondeo      433
         mustang      32
         other       203
         s_max         5
         transit     105
mazda    1_reihe      17
         3_reihe     137
         5_reihe      21
         6_reihe     124
         cx_reihe      4
         mx_reihe     62
         other       146
         rx_reihe     12
peugeot  1_reihe     132
         2_reihe     322
         3_reihe     208
         4_reihe      56
         5_reihe       7
         other       184
renault  clio        523
         espace       98
         kangoo      181
         laguna      181
         megane      399
         modus        38
         other       104
         r19          22
         scenic      216
         twingo      955
smart    forfour      30
         fortwo      423
         other        24
         roadster     12
dtype: int64
No description has been provided for this image
brand          model   
bmw            1er          123
               3er         1534
               5er          499
               6er           13
               7er           79
               i3             3
               m_reihe       17
               other         85
               x_reihe      125
               z_reihe       22
mercedes_benz  a_klasse     539
               b_klasse      54
               c_klasse     744
               cl            21
               clk          138
               e_klasse     638
               g_klasse      14
               gl             1
               glk            4
               m_klasse      71
               other        326
               s_klasse     114
               sl            49
               slk           76
               sprinter     132
               v_klasse      24
               viano         31
               vito         127
opel           agila         70
               antara        13
               astra       1111
               calibra       22
               combo         57
               corsa       1770
               insignia      18
               kadett        74
               meriva        61
               omega        163
               other        152
               signum        29
               tigra         72
               vectra       519
               vivaro        28
               zafira       353
dtype: int64
No description has been provided for this image
brand       model      
volkswagen  golf           2460
            polo           1611
            passat          883
            transporter     496
            touran          361
            lupo            332
            sharan          230
            caddy           190
            beetle          166
            other           137
            fox              75
            bora             62
            touareg          58
            jetta            47
            scirocco         27
            tiguan           24
            phaeton          20
            eos               8
            cc                8
            up                4
            amarok            1
dtype: int64
No description has been provided for this image
In [42]:
model_col_notna_top_vw['model'].unique()

vw0 = ['golf', 'polo', 'passat', 'transporter', 'touran', 'lupo', 'sharan',
       'caddy', 'beetle', 'fox', 'bora', 'touareg',  'jetta', 
       'scirocco', 'tiguan', 'phaeton', 'eos', 'cc', 'up', 'amarok']


model_col_notna_top['model'].unique()

bmw0 = ['3er', '5er', 'x_reihe', '1er', '7er', 'z_reihe', 'm_reihe' '6er', 'i3'] 

merc0 = ['c_klasse', 'e_klasse', 'a_klasse', 'clk', 'sprinter', 'vito', 's_klasse', 'slk', 'm_klasse', 
         'b_klasse', 'sl', 'viano', 'v_klasse', 'cl', 'g_klasse', 'glk', 'gl']

opel0 = ['corsa', 'astra', 'vectra', 'zafira', 'omega', 'kadett', 'tigra', 'agila', 'meriva', 'combo', 
         'signum', 'vivaro', 'calibra', 'insignia', 'antara']

model_col_notna_top_middle['model'].unique()

audi0 = ['a4', 'a3', 'a6', '80', 'a8', 'a2','100', 'tt', 'a5', 'q7', '90', 'a1', '200', 'q5', 'q3']

fiat0 = ['punto', 'seicento', 'ducato', 'panda', 'stilo', '500', 'bravo', 'doblo', 'croma']

ford0 = ['fiesta', 'focus', 'ka', 'mondeo', 'galaxy', 'escort', 'transit', 'c_max', 'mustang', 
         'fusion', 'kuga', 's_max', 'b_max']

mazda0 = ['3_reihe', '6_reihe', 'mx_reihe', '5_reihe', '1_reihe', 'rx_reihe', 'cx_reihe']

peu0 = '2_reihe', '3_reihe', '1_reihe', '4_reihe', '5_reihe',

ren0 = ['twingo', 'clio', 'megane', 'scenic',  'kangoo', 'laguna', 'espace', 'modus',  'r19']       

smart0 = ['fortwo', 'forfour', 'roadster']
In [43]:
vw_model = df[(df['brand'] == 'volkswagen') & (df['power'] == 0) & (df['model'].isin(vw0))]

pivot_table_min_vw = pd.pivot_table(vw_model, index = 'model', columns = 'vehicletype', values = 'registrationyear', aggfunc = ('min'))
pivot_table_min_vw.plot(kind = 'bar', figsize = (12,8))
plt.title("Volkswagen Models with 0hp Engines")
plt.ylim(1892, 2026)
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [44]:
df['datecreated'] = pd.to_datetime(df['datecreated'])
In [45]:
# Cars shouldn't be registered after datecreated

plt.scatter(df['registrationyear'], df['datecreated'].dt.year, alpha=0.3)
plt.xlabel("Registration Year")
plt.ylabel("Ad Creation Year")
plt.title("Registration Year vs Ad Creation Year")
plt.show()
No description has been provided for this image
In [46]:
# The latest registrationyear should be 2016

mask = (df['registrationyear'] > 2016) & (df['registration_correction'] != "Y: too late")

df.loc[mask,['registration_correction']] = "Y: too late"

df[(df['registration_correction'] == "Y: too late") & (df['brand'] == 'volkswagen')].value_counts(subset = 'model')
Out[46]:
model
golf           1612
polo            613
passat          303
lupo            227
touran          209
transporter     155
caddy            94
sharan           78
beetle           51
bora             30
other            29
fox              27
jetta            16
scirocco         15
touareg          13
tiguan           12
eos              10
up                7
phaeton           6
cc                5
amarok            1
dtype: int64
In [47]:
df[(df['postalcode']) & (df['model'] == 'bora')].value_counts(subset = 'postalcode').head(60)
Out[47]:
postalcode
56727    5
12051    5
22589    4
23845    4
6749     4
9111     4
30179    4
47167    4
13359    4
53773    4
47475    3
26441    3
74523    3
12157    3
32683    3
59269    3
38259    3
38531    3
45663    3
21075    3
15517    3
84307    3
44805    3
32791    3
44145    3
27283    3
94447    3
31275    3
1219     3
49835    3
4639     3
33719    3
31167    2
30851    2
34117    2
76437    2
37359    2
31137    2
21337    2
27751    2
75181    2
66333    2
65779    2
33609    2
1169     2
65451    2
65199    2
40219    2
63743    2
76149    2
67059    2
34431    2
24247    2
27419    2
27793    2
28213    2
28325    2
26835    2
35415    2
56841    2
dtype: int64
In [48]:
del model_col 
del model_col_p0
del brand_p0 
del brand_p0_rows_bottom 
del brand_p0_rows_mbottom                     
del brand_p0_rows_middle 
del brand_p0_rows_top
del brand_p0_rows_vw 
del model_col_notna_top_bottom 
del model_col_notna_top_mbottom 
del model_col_notna_top_middle 
del model_col_notna_top 
del model_col_notna_top_vw 
del brand_p0_notna_top_bottom 
del brand_p0_notna_top_mbottom 
del brand_p0_notna_top_middle
del brand_p0_notna_top 
del brand_p0_notna_top_vw

gc.collect()
Out[48]:
32347
In [49]:
# These postal codes are German; additionally the Bora was replaced (in Germany) by Jetta after 2005
df[(df['model'] == 'bora') & (df['registrationyear'] > 2005)]
bora_to_jetta = (df['model'] == 'bora') & (df['registrationyear'] > 2005)
df.loc[bora_to_jetta,['model']] = 'jetta'
df.loc[bora_to_jetta,['registrationyear']] = 2016
df.loc[bora_to_jetta,['registration_correction']] = 'N'
df.loc[[18669]]
Out[49]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
18669 03/04/2016 10:45 2499 NaN 2016.0 manual 101 jetta 150000 6 NaN volkswagen no 2016-03-04 0 99097 07/04/2016 11:44 N
In [50]:
# The registration date cannot supercede the datecreated year
df[(df['model'] == 'jetta') & (df['registrationyear'] > 2015)]

# All dates are close to 2016, can assume simple error
jetta16 = (df['model'] == 'jetta') & (df['registrationyear'] > 2016)
df.loc[jetta16, ['registrationyear']] = 2016
df.loc[jetta16, ['registration_correction']] = "N"
df[(df['model'] == 'jetta') & (df['registrationyear'] == 2016)]
Out[50]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
12573 19/03/2016 16:45 5000 NaN 2016.0 manual 115 jetta 150000 0 NaN volkswagen no 2016-03-19 0 99310 20/03/2016 18:46 N
13726 14/03/2016 18:47 0 NaN 2016.0 auto 0 jetta 150000 5 gasoline volkswagen no 2016-03-14 0 25554 06/04/2016 03:16 N
18669 03/04/2016 10:45 2499 NaN 2016.0 manual 101 jetta 150000 6 NaN volkswagen no 2016-03-04 0 99097 07/04/2016 11:44 N
19856 07/03/2016 13:55 2150 NaN 2016.0 manual 75 jetta 150000 2 lpg volkswagen NaN 2016-07-03 0 64354 23/03/2016 05:20 N
42319 02/04/2016 13:55 3599 NaN 2016.0 manual 101 jetta 150000 12 NaN volkswagen no 2016-02-04 0 33334 06/04/2016 12:16 N
69206 04/04/2016 21:42 1790 NaN 2016.0 manual 101 jetta 150000 1 petrol volkswagen no 2016-04-04 0 10117 07/04/2016 00:15 N
69702 31/03/2016 22:54 1499 NaN 2016.0 NaN 0 jetta 150000 8 NaN volkswagen NaN 2016-03-31 0 39112 01/04/2016 01:42 N
73765 11/03/2016 14:57 1700 NaN 2016.0 manual 75 jetta 150000 3 NaN volkswagen no 2016-11-03 0 55270 07/04/2016 13:15 N
92577 24/03/2016 21:57 4790 NaN 2016.0 manual 204 jetta 150000 1 NaN volkswagen no 2016-03-24 0 8056 05/04/2016 15:45 N
109125 23/03/2016 14:55 2499 NaN 2016.0 manual 116 jetta 150000 4 petrol volkswagen NaN 2016-03-23 0 38154 01/04/2016 20:17 N
111319 17/03/2016 00:32 1875 NaN 2016.0 auto 90 jetta 150000 0 petrol volkswagen NaN 2016-03-17 0 56340 05/04/2016 22:44 N
115258 29/03/2016 15:46 3300 NaN 2016.0 NaN 150 jetta 150000 11 petrol volkswagen no 2016-03-29 0 96199 06/04/2016 01:17 N
115695 16/03/2016 16:48 3000 NaN 2016.0 auto 105 jetta 150000 8 NaN volkswagen yes 2016-03-16 0 13051 16/03/2016 16:48 N
116347 04/04/2016 12:50 2800 NaN 2016.0 manual 70 jetta 70000 2 NaN volkswagen no 2016-04-04 0 85459 06/04/2016 14:16 N
120275 29/03/2016 19:56 6800 NaN 2016.0 manual 102 jetta 60000 3 petrol volkswagen no 2016-03-29 0 17213 04/04/2016 05:17 N
132768 01/04/2016 09:51 1799 NaN 2016.0 manual 0 jetta 150000 5 petrol volkswagen NaN 2016-01-04 0 56727 01/04/2016 10:44 N
138919 07/03/2016 16:59 4300 NaN 2016.0 manual 75 jetta 150000 8 NaN volkswagen no 2016-07-03 0 12439 17/03/2016 06:45 N
142694 07/03/2016 16:48 0 NaN 2016.0 auto 90 jetta 20000 2 NaN volkswagen NaN 2016-07-03 0 13587 09/03/2016 12:45 N
145467 02/04/2016 20:53 3050 NaN 2016.0 manual 0 jetta 150000 11 petrol volkswagen no 2016-02-04 0 35260 02/04/2016 21:41 N
153881 15/03/2016 21:45 2888 NaN 2016.0 manual 110 jetta 150000 5 gasoline volkswagen no 2016-03-15 0 15806 19/03/2016 18:44 N
162111 16/03/2016 11:49 3999 NaN 2016.0 manual 160 jetta 125000 0 petrol volkswagen no 2016-03-16 0 38458 22/03/2016 15:45 N
170553 26/03/2016 10:55 1200 NaN 2016.0 manual 150 jetta 150000 12 NaN volkswagen no 2016-03-26 0 57250 05/04/2016 22:45 N
184846 31/03/2016 10:50 1850 NaN 2016.0 manual 0 jetta 150000 5 petrol volkswagen NaN 2016-03-31 0 56727 31/03/2016 10:50 N
185565 01/04/2016 18:53 2200 NaN 2016.0 NaN 0 jetta 150000 0 petrol volkswagen NaN 2016-01-04 0 26441 01/04/2016 18:53 N
189535 15/03/2016 22:37 1500 NaN 2016.0 manual 115 jetta 150000 0 petrol volkswagen NaN 2016-03-15 0 9387 16/03/2016 00:41 N
199567 20/03/2016 16:50 0 NaN 2016.0 manual 90 jetta 150000 0 petrol volkswagen NaN 2016-03-20 0 99867 25/03/2016 17:22 N
208403 02/04/2016 08:55 1600 NaN 2016.0 manual 0 jetta 150000 5 petrol volkswagen NaN 2016-02-04 0 56727 02/04/2016 09:46 N
219654 11/03/2016 10:37 5500 NaN 2016.0 manual 150 jetta 150000 7 NaN volkswagen NaN 2016-11-03 0 2763 07/04/2016 01:45 N
223717 27/03/2016 18:56 0 NaN 2016.0 manual 90 jetta 150000 0 petrol volkswagen no 2016-03-27 0 17109 31/03/2016 00:44 N
227749 30/03/2016 14:50 1590 NaN 2016.0 manual 150 jetta 150000 8 petrol volkswagen NaN 2016-03-30 0 87629 03/04/2016 04:44 N
228488 19/03/2016 13:38 2300 NaN 2016.0 manual 130 jetta 150000 2 NaN volkswagen no 2016-03-19 0 2625 02/04/2016 19:15 N
231870 29/03/2016 03:02 2800 NaN 2016.0 auto 69 jetta 150000 4 petrol volkswagen no 2016-03-29 0 38486 05/04/2016 17:44 N
234824 17/03/2016 11:54 2850 NaN 2016.0 auto 105 jetta 150000 8 NaN volkswagen yes 2016-03-17 0 13051 17/03/2016 11:54 N
237026 16/03/2016 21:46 3000 NaN 2016.0 auto 105 jetta 150000 8 NaN volkswagen yes 2016-03-16 0 13051 16/03/2016 21:46 N
239203 19/03/2016 22:45 2350 NaN 2016.0 manual 101 jetta 150000 10 NaN volkswagen no 2016-03-19 0 33100 07/04/2016 12:17 N
241686 30/03/2016 14:36 3500 NaN 2016.0 manual 115 jetta 150000 5 NaN volkswagen no 2016-03-30 0 49356 07/04/2016 06:15 N
248831 09/03/2016 17:52 1700 NaN 2016.0 manual 150 jetta 150000 5 NaN volkswagen no 2016-09-03 0 42109 12/03/2016 18:15 N
259038 08/03/2016 10:50 1950 NaN 2016.0 manual 101 jetta 150000 1 lpg volkswagen no 2016-08-03 0 47167 16/03/2016 20:48 N
261773 02/04/2016 15:55 1899 NaN 2016.0 auto 0 jetta 150000 5 petrol volkswagen no 2016-02-04 0 6193 02/04/2016 15:55 N
265058 15/03/2016 13:54 0 NaN 2016.0 manual 0 jetta 100000 2 petrol volkswagen NaN 2016-03-15 0 94491 31/03/2016 11:18 N
265331 03/04/2016 19:58 599 NaN 2016.0 manual 75 jetta 150000 0 petrol volkswagen no 2016-03-04 0 4668 05/04/2016 20:45 N
278004 17/03/2016 15:48 3100 NaN 2016.0 auto 90 jetta 125000 10 NaN volkswagen NaN 2016-03-17 0 52393 19/03/2016 15:44 N
290135 13/03/2016 19:38 3000 NaN 2016.0 manual 204 jetta 150000 0 petrol volkswagen NaN 2016-03-13 0 36369 17/03/2016 13:47 N
297474 14/03/2016 09:49 4100 NaN 2016.0 auto 105 jetta 150000 5 NaN volkswagen no 2016-03-14 0 45721 14/03/2016 09:49 N
304913 24/03/2016 22:46 4790 NaN 2016.0 manual 204 jetta 150000 1 NaN volkswagen no 2016-03-24 0 8056 05/04/2016 17:44 N
313465 21/03/2016 12:50 1200 NaN 2016.0 manual 115 jetta 150000 0 NaN volkswagen NaN 2016-03-21 0 84508 06/04/2016 07:45 N
315015 02/04/2016 22:57 3990 NaN 2016.0 auto 90 jetta 100000 8 NaN volkswagen no 2016-02-04 0 77656 07/04/2016 03:45 N
317772 10/03/2016 18:38 2500 NaN 2016.0 manual 75 jetta 150000 5 petrol volkswagen NaN 2016-10-03 0 6667 05/04/2016 21:18 N
317964 26/03/2016 08:54 0 NaN 2016.0 manual 90 jetta 150000 0 petrol volkswagen no 2016-03-26 0 17109 31/03/2016 03:46 N
324089 31/03/2016 18:56 3299 NaN 2016.0 manual 90 jetta 150000 2 NaN volkswagen NaN 2016-03-31 0 21481 06/04/2016 13:15 N
327076 26/03/2016 11:54 1600 NaN 2016.0 NaN 90 jetta 125000 3 NaN volkswagen NaN 2016-03-26 0 52393 06/04/2016 00:15 N
337067 26/03/2016 14:52 1450 NaN 2016.0 manual 0 jetta 150000 7 petrol volkswagen NaN 2016-03-26 0 47137 31/03/2016 10:17 N
341498 16/03/2016 16:44 750 NaN 2016.0 manual 75 jetta 150000 0 petrol volkswagen NaN 2016-03-16 0 35274 28/03/2016 15:46 N
350647 27/03/2016 20:49 4999 NaN 2016.0 manual 115 jetta 150000 11 gasoline volkswagen NaN 2016-03-27 0 91486 27/03/2016 20:49 N
In [51]:
# Find the Brand and Model's where the minimum price is not 0
brands_with_price = df.groupby(['brand','model'])['price'].min()
brands_with_price[brands_with_price != 0]
Out[51]:
brand          model             
audi           q5                       65
bmw            i3                      250
chevrolet      aveo                    350
chrysler       crossfire              3333
               grand                   100
dacia          lodgy                  4900
daewoo         kalos                   250
daihatsu       charade                 150
               materia                2800
               terios                  750
fiat           croma                   350
ford           b_max                  5199
kia            picanto                 500
lada           kalina                  500
lancia         elefantino               80
               kappa                    50
               other                     1
land_rover     other                   550
               range_rover_evoque    12500
               range_rover_sport      1750
               serie_2                6300
mercedes_benz  glk                      30
nissan         juke                      1
rover          defender                550
               discovery              2800
               rangerover             1050
seat           exeo                   5900
               mii                    2500
skoda          citigo                 3690
               yeti                   1750
suzuki         jimny                  1200
toyota         auris                     1
volvo          v60                    1000
Name: price, dtype: int64
In [52]:
pivot = pd.pivot_table(df, index = 'model', columns = 'brand', values = 'price')

pivot.boxplot(vert = False, figsize = (12,8))
Out[52]:
<AxesSubplot:>
No description has been provided for this image
In [53]:
chevy = df[df['brand'] == 'chevrolet']
chevy_pivot = pd.pivot_table(chevy, index = 'registrationyear', columns = 'model', values = 'price')
chevy_pivot
chevy_pivot.boxplot(vert = False)
Out[53]:
<AxesSubplot:>
No description has been provided for this image
In [54]:
captiva = (df['vehicletype'] == 'suv') & (df['registrationyear'] > 2005) & (df['brand'] == 'chevrolet')
df.loc[captiva,['model']] = 'captiva'

df[(df['vehicletype'] == 'suv') & (df['registrationyear'] > 2005) & (df['brand'] == 'chevrolet')]
Out[54]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
2670 07/03/2016 23:56 9199 suv 2006.0 manual 150 captiva 125000 10 gasoline chevrolet no 2016-07-03 0 59821 17/03/2016 21:45 N
7816 02/04/2016 14:45 8600 suv 2008.0 auto 150 captiva 5000 9 gasoline chevrolet no 2016-02-04 0 33602 06/04/2016 13:15 N
9501 03/04/2016 13:46 14950 suv 2011.0 auto 184 captiva 90000 9 gasoline chevrolet no 2016-03-04 0 47918 05/04/2016 12:45 N
10054 09/03/2016 10:53 9500 suv 2007.0 auto 150 captiva 150000 9 gasoline chevrolet no 2016-09-03 0 39343 05/04/2016 14:46 N
10933 22/03/2016 23:59 8300 suv 2008.0 manual 136 captiva 30000 5 petrol chevrolet no 2016-03-22 0 71065 30/03/2016 14:18 N
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
337492 23/03/2016 13:55 18500 suv 2013.0 manual 167 captiva 20000 1 petrol chevrolet no 2016-03-23 0 99734 05/04/2016 15:18 N
341978 03/04/2016 16:50 14999 suv 2012.0 manual 163 captiva 70000 4 gasoline chevrolet no 2016-03-04 0 26209 05/04/2016 16:46 N
342687 31/03/2016 21:58 15900 suv 2012.0 auto 184 captiva 80000 8 gasoline chevrolet no 2016-03-31 0 15831 06/04/2016 18:17 N
344562 10/03/2016 15:49 11990 suv 2011.0 manual 167 captiva 40000 11 petrol chevrolet no 2016-10-03 0 91452 21/03/2016 02:45 N
354111 16/03/2016 16:55 15700 suv 2012.0 auto 184 captiva 100000 3 gasoline chevrolet no 2016-03-16 0 46242 06/04/2016 21:47 N

186 rows × 17 columns

In [55]:
convertible = (df['brand'] == 'chevrolet') & (df['vehicletype'] == 'convertible')
df.loc[convertible,['model']] = 'other'
In [56]:
matiz68 = (df['brand'] == 'chevrolet') & (df['power'] == 68) & (df['price'] < 2600)
df.loc[matiz68,['model']] = 'matiz'
df.loc[matiz68,['vehicletype']] = 'small'

df[(df['brand'] == 'chevrolet') & (df['power'] == 68) & (df['price'] < 2600)]
Out[56]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
82008 08/03/2016 22:44 2599 small 2008.0 manual 68 matiz 100000 8 NaN chevrolet NaN 2016-08-03 0 44145 14/03/2016 06:16 N
140254 22/03/2016 21:36 1200 small 2005.0 manual 68 matiz 90000 5 petrol chevrolet NaN 2016-03-22 0 4155 24/03/2016 07:15 N
205903 14/03/2016 19:41 1799 small 2008.0 manual 68 matiz 100000 5 petrol chevrolet no 2016-03-14 0 24816 06/04/2016 04:17 N
257625 23/03/2016 10:38 1500 small 2005.0 manual 68 matiz 150000 11 lpg chevrolet NaN 2016-03-23 0 41238 24/03/2016 17:17 N
353189 19/03/2016 13:37 1200 small 2016.0 manual 68 matiz 90000 5 petrol chevrolet NaN 2016-03-19 0 4155 21/03/2016 17:50 N
In [57]:
matiz52 = (df['brand'] == 'chevrolet') & (df['power'] == 52)
df.loc[matiz52,['model']] = 'matiz'
df.loc[matiz52,['vehicletype']] = 'small'
df[(df['brand'] == 'chevrolet') & (df['power'] == 52)]
Out[57]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
373 02/04/2016 12:39 1350 small 2005.0 manual 52 matiz 150000 6 petrol chevrolet yes 2016-02-04 0 91207 06/04/2016 10:17 N
2263 27/03/2016 19:55 2399 small 2016.0 manual 52 matiz 80000 7 petrol chevrolet NaN 2016-03-27 0 33605 05/04/2016 18:45 N
2820 26/03/2016 20:47 3350 small 2010.0 manual 52 matiz 80000 2 petrol chevrolet no 2016-03-26 0 18273 06/04/2016 11:17 N
5636 30/03/2016 08:55 3650 small 2009.0 manual 52 matiz 50000 7 petrol chevrolet no 2016-03-30 0 26789 30/03/2016 08:55 N
7123 04/04/2016 18:39 2500 small 2008.0 manual 52 matiz 125000 12 petrol chevrolet no 2016-04-04 0 21493 06/04/2016 20:44 N
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340075 17/03/2016 21:37 4999 small 2010.0 auto 52 matiz 30000 3 petrol chevrolet no 2016-03-17 0 45329 17/03/2016 22:40 N
340549 29/03/2016 15:57 1599 small 2009.0 manual 52 matiz 80000 5 petrol chevrolet no 2016-03-29 0 20357 06/04/2016 02:15 N
344585 13/03/2016 17:50 2100 small 2009.0 manual 52 matiz 125000 11 petrol chevrolet no 2016-03-13 0 22869 28/03/2016 14:16 N
349474 08/03/2016 13:25 2600 small 2009.0 manual 52 matiz 50000 3 petrol chevrolet no 2016-08-03 0 65719 11/03/2016 09:45 N
349800 01/04/2016 22:38 1950 small 2008.0 manual 52 matiz 60000 9 petrol chevrolet no 2016-01-04 0 42369 01/04/2016 23:41 N

101 rows × 17 columns

In [58]:
matiz67 = (df['brand'] == 'chevrolet') & (df['power'] == 67)
df.loc[matiz67,['model']] = 'matiz'
df.loc[matiz67,['vehicletype']] = 'small'

df[(df['brand'] == 'chevrolet') & (df['power'] == 67)]
Out[58]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
1981 27/03/2016 18:43 2990 small 2007.0 manual 67 matiz 125000 4 lpg chevrolet no 2016-03-27 0 72108 05/04/2016 15:15 N
3769 01/04/2016 15:53 1500 small 2016.0 manual 67 matiz 125000 10 NaN chevrolet NaN 2016-01-04 0 4158 07/04/2016 13:50 N
5215 26/03/2016 08:55 2900 small 2010.0 manual 67 matiz 80000 4 petrol chevrolet no 2016-03-26 0 25421 03/04/2016 19:47 N
7757 21/03/2016 09:52 3750 small 2007.0 manual 67 matiz 70000 10 lpg chevrolet no 2016-03-21 0 53945 06/04/2016 02:45 N
9006 14/03/2016 11:38 2750 small 2007.0 manual 67 matiz 70000 10 petrol chevrolet no 2016-03-14 0 21029 07/04/2016 12:45 N
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340326 02/04/2016 22:51 2150 small 2007.0 manual 67 matiz 150000 12 petrol chevrolet no 2016-02-04 0 31863 07/04/2016 00:45 N
344984 26/03/2016 22:54 2100 small 2007.0 manual 67 matiz 125000 6 petrol chevrolet no 2016-03-26 0 48565 04/04/2016 22:47 N
348552 04/04/2016 13:46 2250 small 2006.0 manual 67 matiz 150000 7 lpg chevrolet no 2016-04-04 0 33397 06/04/2016 14:46 N
351693 28/03/2016 17:41 1100 small 2006.0 manual 67 matiz 150000 6 petrol chevrolet no 2016-03-28 0 46537 06/04/2016 23:15 N
352283 12/03/2016 15:46 1950 small 2007.0 manual 67 matiz 90000 8 petrol chevrolet no 2016-12-03 0 48529 15/03/2016 21:16 N

91 rows × 17 columns

In [59]:
peugeot = df[df['brand'] == 'peugeot']
peugeot_pivot = pd.pivot_table(peugeot,index = 'power', columns = 'model', values = 'price')

df[(df['brand'] == 'peugeot') & (df['power'].isin([7,33,42,43,48,57, 454]))]
re_1 = (df['brand'] == 'peugeot') & (df['power'].isin([7,33,42,43,48,57,454]))
df.loc[re_1,['vehicletype']] = 'small'
df.loc[re_1,['model']] = '1_reihe'

df[(df['brand'] == 'peugeot') & (df['power'].isin([7,33,42,43,48,57,454]))]
Out[59]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction
44179 02/04/2016 17:52 500 small 1998.0 auto 7 1_reihe 100000 11 petrol peugeot no 2016-02-04 0 66271 02/04/2016 17:52 N
154470 07/03/2016 10:52 100 small 1995.0 manual 42 1_reihe 150000 6 petrol peugeot NaN 2016-07-03 0 1665 15/03/2016 22:16 N
174795 10/03/2016 23:44 150 small 1997.0 manual 33 1_reihe 150000 11 petrol peugeot yes 2016-10-03 0 66333 11/03/2016 12:17 N
186556 20/03/2016 16:55 430 small 2016.0 NaN 33 1_reihe 150000 9 petrol peugeot NaN 2016-03-20 0 73525 04/04/2016 20:44 N
191097 23/03/2016 22:51 0 small 1997.0 manual 33 1_reihe 125000 6 NaN peugeot yes 2016-03-23 0 86343 06/04/2016 06:45 N
204925 29/03/2016 15:45 850 small 1997.0 manual 57 1_reihe 150000 2 petrol peugeot no 2016-03-29 0 16909 06/04/2016 01:16 N
210942 30/03/2016 15:51 700 small 1998.0 manual 454 1_reihe 150000 8 petrol peugeot NaN 2016-03-30 0 85598 30/03/2016 15:51 N
262687 05/03/2016 16:52 0 small 1996.0 manual 48 1_reihe 150000 7 petrol peugeot yes 2016-05-03 0 26441 24/03/2016 18:45 N
314981 20/03/2016 04:02 700 small 2017.0 manual 33 1_reihe 150000 7 petrol peugeot no 2016-03-20 0 28759 23/03/2016 22:17 Y: too late
323988 10/03/2016 22:50 1033 small 1996.0 manual 43 1_reihe 150000 10 petrol peugeot no 2016-10-03 0 42277 24/03/2016 20:18 N
In [60]:
coupe = df[(df['vehicletype'] == 'coupe') & (df['price'] > 0)]
suv = df[(df['vehicletype'] == 'suv') & (df['price'] > 0)]
small = df[(df['vehicletype'] == 'small') & (df['price'] > 0)]
sedan = df[(df['vehicletype'] == 'sedan') & (df['price'] > 0)]
convertible = df[(df['vehicletype'] == 'convertible') & (df['price'] > 0)]
bus = df[(df['vehicletype'] == 'bus') & (df['price'] > 0)]
wagon = df[(df['vehicletype'] == 'wagon') & (df['price'] > 0)]
In [61]:
wagon['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Wagons per Brand')
Out[61]:
<AxesSubplot:title={'center':'Number of Wagons per Brand'}>
No description has been provided for this image
In [62]:
wagon.groupby('brand')['price'].mean().sort_values(ascending=False).plot(kind='bar', figsize=(10,5), title='Average Wagon Price per Brand')
Out[62]:
<AxesSubplot:title={'center':'Average Wagon Price per Brand'}, xlabel='brand'>
No description has been provided for this image
In [63]:
plt.figure(figsize=(14,16))
sns.boxplot(data=wagon, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Wagon Models per Brand')
plt.grid()
plt.show()
No description has been provided for this image

Wagon Type Vehicles Against Price

Brand Vehicle Type (~)Count Avg Price Distribution (25 - 75)
volkswagen Wagon 12,500 5,000 1,250 - 7,000
audi Wagon 11,000 7,000 2,500 - 11,000
bmw Wagon 8,000 7,000 2,300 - 9,500
opel Wagon 7,000 3,500 1,000 - 4,500
mercedes_benz Wagon 6,500 6,000 1,500 - 8,500
ford Wagon 5,900 6,000 1,500 - 8,000
skoda Wagon 3,000 6,500 2,000 - 9,000
volvo Wagon 2,200 5,500 2,000 - 7,500
renault Wagon 2,000 3,000 1,000 - 4,000
peugeot Wagon 1,800 4,900 1,500 - 6,500
mazda Wagon 1,000 4,800 2,000 - 6,500
toyota Wagon 800 4,700 2,000 - 6,500
alfa_romeo Wagon 600 4,400 1,500 - 6,000
fiat Wagon 500 2,200 1,000 - 3,000
seat Wagon 500 4,000 1,500 - 5,500
nissan Wagon 400 3,800 1,500 - 5,000
citroen Wagon 400 3,700 1,500 - 5,000
mitsubishi Wagon 300 1,800 800 - 2,500
dacia Wagon 300 3,700 2,000 - 5,000
chevrolet Wagon 200 3,500 1,500 - 5,000
hyundai Wagon 200 11,500 6,000 - 15,000
kia Wagon 200 3,300 1,500 - 4,500
mini Wagon 100 8,000 4,000 - 11,000
subaru Wagon <100 4,000 2,000 - 5,500
honda Wagon <100 3,000 1,500 - 4,000
chrysler Wagon <100 2,800 1,000 - 4,000
saab Wagon <100 2,800 1,000 - 4,000
suzuki Wagon <100 2,300 1,000 - 3,000
smart Wagon <100 2,200 1,000 - 3,000
lancia Wagon <100 2,000 800 - 3,000
daewoo Wagon <100 900 500 - 1,200
jaguar Wagon <100 1,800 1,000 - 2,500
land_rover Wagon <100 2,900 1,500 - 4,000
lada Wagon <100 1,700 800 - 2,500
rover Wagon <100 1,600 800 - 2,200
trabant Wagon <100 1,800 1,000 - 2,500
In [64]:
df[(df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & \
    (df['price'] < 7000) & (df['registrationyear'] > 1996) & (df['registrationyear'] < 1999) & (df['power'].isin([150]))]

passat = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'] > 1996) & \
(df['registrationyear'] < 1999) & (df['power'].isin([150]))
df.loc[passat,['model']] = 'passat'


passat1 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'] == 1991) & \
    (df['power'].isin([90,136]))
df.loc[passat1,['model']] = 'passat'

passat2 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'] == 1992) & \
    (df['model'].isna())
df.loc[passat2,['model']] = 'passat'

passat3 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & \
(df['registrationyear'].isin([1982,1993,1994])) & (df['model'].isna())
df.loc[passat3,['model']] = 'passat'

passat4 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'].isin([1996])) & \
(df['power'].isin([174])) & (df['model'].isna())
df.loc[passat4,['model']] = 'passat'

passat5 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['model'].isna()) & (df['power'].isin([125])) & (df['price'] > 1250) & \
(df['price'] < 7000)
df.loc[passat5,['model']] = 'passat'

passat6 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['power'].isin([110,193])) & (df['price'] > 1250) & (df['price'] < 7000) & \
(df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[passat6, ['model']] = 'passat'

golf = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'].isin([1996])) & \
(df['power'].isin([75,110])) & (df['model'].isna())

df.loc[golf,['model']] = 'golf'
In [65]:
passat140 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['model'].isna()) & (df['registrationyear'] > 2004) & \
(df['registrationyear'] < 2007) & (df['power'].isin([140]))
df.loc[passat140,['model']] = 'passat'

golf90 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2000) & \
(df['registrationyear'] < 2005) & (df['power'].isin([90]))
df.loc[golf90,['model']] = 'golf'

passat90 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1985) & \
(df['registrationyear'] < 1993) & (df['power'].isin([90]))
df.loc[passat90,['model']] = 'passat'

golf75 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1993) & \
(df['registrationyear'] < 1995) & (df['power'].isin([75]))
df.loc[golf75,['model']] = 'golf'

golf7502 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2001) & \
(df['registrationyear'] < 2003) & (df['power'].isin([75]))
df.loc[golf7502,['model']] = 'golf'

passat105 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1995) & \
(df['registrationyear'] < 1998) & (df['power'].isin([105]))
df.loc[passat105,['model']] = 'passat'

passat131 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1999) & \
(df['registrationyear'] < 2002) & (df['power'].isin([131]))
df.loc[passat131,['model']] = 'passat'

passat116 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1989) \
& (df['registrationyear'] < 1997) & (df['power'].isin([116]))
df.loc[passat116,['model']] = 'passat'

passat150 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1995) \
& (df['registrationyear'] < 2006) & (df['power'].isin([150]))
df.loc[passat150,['model']] = 'passat'

passat115 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1990) \
& (df['registrationyear'] < 1997) & (df['power'].isin([115]))
df.loc[passat115,['model']] = 'passat'

passat170 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2004) & \
(df['registrationyear'] < 2012) & (df['power'].isin([170]))
df.loc[passat170,['model']] = 'passat'

golf110 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2013) & \
(df['registrationyear'] < 2017) & (df['power'].isin([60]))
df.loc[golf110,['model']] = 'golf'

golf60 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1990) & \
(df['registrationyear'] < 1996) & (df['power'].isin([60]))
df.loc[golf60,['model']] = 'golf'

polo60 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1996) & \
(df['registrationyear'] < 2001) & (df['power'].isin([60]))
df.loc[polo60,['model']] = 'polo'

passat125 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1996) & \
(df['registrationyear'] < 2000) & (df['power'].isin([125]))
df.loc[passat125,['model']] = 'passat'

passat100 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1990) & \
(df['registrationyear'] < 2005) & (df['power'].isin([100]))
df.loc[passat100,['model']] = 'passat'

passat174 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1993) & \
(df['registrationyear'] < 1997) & (df['power'].isin([174]))
df.loc[passat174,['model']] = 'passat'

passat130 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1998) & \
(df['registrationyear'] < 2005) & (df['power'].isin([130]))
df.loc[passat130,['model']] = 'passat'

passat120 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1980) & (df['registrationyear'] < 2000) & (df['power'].isin([120]))
df.loc[passat120,['model']] = 'passat'
In [66]:
vw_small75 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1985,1992]))
df.loc[vw_small75,['model']] = 'golf'

vw_sedan75 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'] > 1993) & (df['registrationyear'] < 2007)
df.loc[vw_sedan75,['model']] = 'golf'

opel_sedan84 = (df['brand'] == 'opel') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1984]))
df.loc[opel_sedan84,['model']] = 'kadett'

opel_sedan94 = (df['brand'] == 'opel') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1994,1999,2000]))
df.loc[opel_sedan94,['model']] = 'astra'

opel_sedan04 = (df['brand'] == 'opel') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([2004,2008]))
df.loc[opel_sedan04,['model']] = 'corsa'

ford_sedan99 = (df['brand'] == 'ford') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1999,2001,2003]))
df.loc[ford_sedan99,['model']] = 'focus'

opel_wagon96 = (df['brand'] == 'opel') & (df['vehicletype'] == 'wagon') & (df['model'].isna()) & (df['power'].isin([75])) & (df['registrationyear'] > 1995) \
& (df['registrationyear'] < 2001)
df.loc[opel_wagon96,['model']] = 'astra'

opel_small01 = (df['brand'] == 'opel') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) \
& (df['registrationyear'].isin([2001, 2002, 2003, 2004, 2006, 2008]))
df.loc[opel_small01,['model']] = 'corsa'

renault_small91 = (df['brand'] == 'renault') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'] > 1990) & (df['registrationyear'] < 2001)
df.loc[renault_small91,['model']] = 'clio'

peugeot_small92 = (df['brand'] == 'peugeot') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1992]))
df.loc[peugeot_small92,['model']] = '1_reihe'

peugeot_small94 = (df['brand'] == 'peugeot') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1994]))
df.loc[peugeot_small94,['model']] = '3_reihe'

peugeot_small00 = (df['brand'] == 'peugeot') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'] > 1999) & (df['registrationyear'] < 2010)
df.loc[peugeot_small00,['model']] = '2_reihe'
In [67]:
del vw_small75
del vw_sedan75
del opel_sedan84
del opel_sedan94
del opel_sedan04
del ford_sedan99
del opel_wagon96
del opel_small01
del renault_small91
del peugeot_small92
del peugeot_small94
del peugeot_small00
In [68]:
brand_power = df[(df['power'].isin([75,60,150,101,140,90,116,105,170,125,136,102])) & (df['model'].notna()) & (df['model'] != 'model') & \
    (df['brand'] != 'sonstige_autos')].value_counts(subset = 'brand')

brand_power.plot(kind = 'bar')
plt.title("Brands with Top HP counts")
plt.grid()
plt.show()
No description has been provided for this image
In [69]:
df[(df['brand'] == 'nissan')].value_counts(subset = 'model')
Out[69]:
model
micra      1756
other       702
primera     620
almera      584
qashqai     531
x_trail     206
note        130
juke        102
navara       98
dtype: int64
In [70]:
brand_power1 = df[(df['power'].isin([75,60])) & (df['model'].notna()) & (df['model'] != 'model') & \
    (df['brand'] != 'sonstige_autos')]
brand_power2 = df[(df['power'].isin([150,101])) & (df['model'].notna()) & (df['model'] != 'model') & \
    (df['brand'] != 'sonstige_autos')]
brand_power3 = df[(df['power'].isin([140,90])) & (df['model'].notna()) & (df['model'] != 'model') & \
    (df['brand'] != 'sonstige_autos')]
brand_power4 = df[(df['power'].isin([116,105])) & (df['model'].notna()) & (df['model'] != 'model') & \
    (df['brand'] != 'sonstige_autos')]
brand_power5 = df[(df['power'].isin([170,125])) & (df['model'].notna()) & (df['model'] != 'model') & \
    (df['brand'] != 'sonstige_autos')]
brand_power6 = df[(df['power'].isin([136,102])) & (df['model'].notna()) & (df['model'] != 'model') & \
    (df['brand'] != 'sonstige_autos')]

top5_brand_power = ['volkswagen','opel','bmw','audi','ford']
over1000_brand_power = ['mercedes_benz', 'renault', 'peugeot', 'seat', 'skoda', 'fiat', 'citroen', 'honda', 'mazda', 'mini', 'nissan', 'mitsubishi', 'volvo']  
under1000_brand_power = ['toyota', 'alfa_romeo', 'hyundai', 'kia', 'dacia', 'suzuki', 'chrysler', 'subaru', 'smart', 'chevrolet', 'saab', 'lancia', 
                        'rover', 'jeep', 'daihatsu', 'daewoo', 'porsche', 'lada', 'land_rover', 'jaguar'] 


top5_brands = brand_power1[brand_power1['brand'].isin(top5_brand_power)]
top5_brands2 = brand_power2[brand_power2['brand'].isin(top5_brand_power)]
top5_brands3 = brand_power3[brand_power3['brand'].isin(top5_brand_power)]
top5_brands4 = brand_power4[brand_power4['brand'].isin(top5_brand_power)]
top5_brands5 = brand_power5[brand_power5['brand'].isin(top5_brand_power)]
top5_brands6 = brand_power6[brand_power6['brand'].isin(top5_brand_power)]

middle_brands = brand_power1[brand_power1['brand'].isin(over1000_brand_power)]
middle_brands2 = brand_power2[brand_power2['brand'].isin(over1000_brand_power)]
middle_brands3 = brand_power3[brand_power3['brand'].isin(over1000_brand_power)]
middle_brands4 = brand_power4[brand_power4['brand'].isin(over1000_brand_power)]
middle_brands5 = brand_power5[brand_power5['brand'].isin(over1000_brand_power)]
middle_brands6 = brand_power6[brand_power6['brand'].isin(over1000_brand_power)]


lower_brands = brand_power1[brand_power1['brand'].isin(under1000_brand_power)]
lower_brands2 = brand_power2[brand_power2['brand'].isin(under1000_brand_power)]
lower_brands3 = brand_power3[brand_power3['brand'].isin(under1000_brand_power)]
lower_brands4 = brand_power4[brand_power4['brand'].isin(under1000_brand_power)]
lower_brands5 = brand_power5[brand_power5['brand'].isin(under1000_brand_power)]
lower_brands6 = brand_power6[brand_power6['brand'].isin(under1000_brand_power)]


# Use known model and power to find Nan
top5 = top5_brands[['brand','model','power']].value_counts().sort_index()
top52 = top5_brands2[['brand','model','power']].value_counts().sort_index()
top53 = top5_brands3[['brand','model','power']].value_counts().sort_index()
top54 = top5_brands4[['brand','model','power']].value_counts().sort_index()
top55 = top5_brands5[['brand','model','power']].value_counts().sort_index()
top56 = top5_brands6[['brand','model','power']].value_counts().sort_index()

middle = middle_brands[['brand','model','power']].value_counts().sort_index()
middle2 = middle_brands2[['brand','model','power']].value_counts().sort_index()
middle3 = middle_brands3[['brand','model','power']].value_counts().sort_index()
middle4 = middle_brands4[['brand','model','power']].value_counts().sort_index()
middle5 = middle_brands5[['brand','model','power']].value_counts().sort_index()
middle6 = middle_brands6[['brand','model','power']].value_counts().sort_index()


lower = lower_brands[['brand','model','power']].value_counts().sort_index()
lower2 = lower_brands2[['brand','model','power']].value_counts().sort_index()
lower3 = lower_brands3[['brand','model','power']].value_counts().sort_index()
lower4 = lower_brands4[['brand','model','power']].value_counts().sort_index()
lower5 = lower_brands5[['brand','model','power']].value_counts().sort_index()
lower6 = lower_brands6[['brand','model','power']].value_counts().sort_index()


print("Batch 1: HP [60 & 70]")
# Top 5 Prevalent Brands w/ specified HP [60 & 70]
top5.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [60 & 70] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

# Middle Prevalent Brands w/ specified HP [60 & 70]
middle.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [60 & 70] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

# Lower Prevalent Brands w/ specified HP [60 & 70]
lower.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [60 & 70] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()


print("Batch 2: HP [150 & 101]")
# Batch 2: HP [150 & 101]
top52.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [150 & 101] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

middle2.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [150 & 101] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

lower2.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [150 & 101] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()



print("Batch 3: HP [140 & 90]")
# Batch 3: HP [140 & 90]
top53.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [140 & 90] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

middle3.plot(kind = 'bar', x = ('brand','model','power'), figsize = (20,8))
plt.title('Middle: Count of [140 & 90] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

lower3.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [140 & 90] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()



print("Batch 4: HP [116 & 105]")
# Batch 4: HP [116 & 105]
top54.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [116 & 105] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

middle4.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [116 & 105] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

lower4.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [116 & 105] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()



print("Batch 5: HP [170 & 125]")
# Batch 5: HP [170 & 125]
top55.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [170 & 125] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

middle5.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [170 & 125] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

lower5.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [170 & 125] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()



print("Batch 6: HP [136 & 102]")
# Batch 6: HP [136 & 102]
top56.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [136 & 102] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

middle6.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [136 & 102] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()

lower6.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [136 & 102] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
Batch 1: HP [60 & 70]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Batch 2: HP [150 & 101]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Batch 3: HP [140 & 90]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Batch 4: HP [116 & 105]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Batch 5: HP [170 & 125]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Batch 6: HP [136 & 102]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [71]:
audi75 = (df['brand'].isin(['audi'])) & (df['power'].isin([60,75])) & (df['model'].isna())
df.loc[audi75,['model']] = 'audi'

bmw75 = (df['brand'].isin(['bmw'])) & (df['power'].isin([60,75])) & (df['model'].isna())
df.loc[bmw75,['model']] = 'bmw'

opelsedan60 = (df['brand'].isin(['opel'])) & (df['power'].isin([60])) & (df['vehicletype'] == 'sedan') & (df['registrationyear'] < 1991) & (df['model'].isna())
df.loc[opelsedan60,['model']] = 'kadett'

opel9160 = (df['brand'].isin(['opel'])) & (df['power'].isin([60])) & ~(df['vehicletype'].isin(['wagon','small'])) & (df['registrationyear'] > 1990) & (df['registrationyear'] < 1992) & (df['model'].isna())
df.loc[opel9160,['model']] = 'kadett'

opelastra = (df['brand'].isin(['opel'])) & (df['vehicletype'] != 'small') & (df['power'].isin([60])) & (df['registrationyear'] > 1991) & (df['registrationyear'] < 1993)& (df['model'].isna())
df.loc[opelastra,['model']] = 'astra'

astraopel = (df['brand'].isin(['opel'])) & (df['vehicletype'] != 'small') & (df['power'].isin([60])) & (df['registrationyear'] > 1992) & (df['registrationyear'] < 2000) & (df['model'].isna())
df.loc[astraopel,['model']] = 'astra'

opelcorsa = (df['brand'].isin(['opel']))  & (df['vehicletype'] != 'bus') & (df['power'].isin([60])) & (df['model'].isna())
df.loc[opelcorsa,['model']] = 'corsa'

opelcombo = (df['brand'].isin(['opel'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[opelcombo,['model']] = 'combo'

civic75 = (df['brand'].isin(['honda'])) & (df['power'].isin([60, 75])) & (df['model'].isna())
df.loc[civic75,['model']] = 'civic'

mini75 = (df['brand'].isin(['mini'])) & (df['power'].isin([60, 75])) & (df['model'].isna())
df.loc[mini75,['model']] = 'one'

nissan60 = (df['brand'].isin(['nissan'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[nissan60,['model']] = 'micra'

seat60 = (df['brand'].isin(['seat'])) & (df['vehicletype'] != 'sedan') & (df['power'].isin([60])) & (df['model'].isna())
df.loc[seat60,['model']] = 'ibiza'

seatcordoba = (df['brand'].isin(['seat'])) & (df['power'].isin([60])) & (df['registrationyear'] == 1994) & (df['model'].isna())
df.loc[seatcordoba,['model']] = 'cordoba'

ibiza60 = (df['brand'].isin(['seat'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[ibiza60,['model']] = 'ibiza'

cordoba93 = (df['brand'].isin(['seat'])) & (df['registrationyear'].isin([1993])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[cordoba93,['model']] = 'cordoba'

ibiza94 = (df['brand'].isin(['seat'])) & (df['registrationyear'].isin([1994])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[ibiza94,['model']] = 'ibiza'

cordoba97 = (df['brand'].isin(['seat'])) & (df['registrationyear'].isin([1997])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[cordoba97,['model']] = 'cordoba'

ibizasmall = (df['brand'].isin(['seat'])) & (df['vehicletype'] == 'small') & (df['registrationyear'].isin([1999])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[ibizasmall,['model']] = 'ibiza'

cordoba99 = (df['brand'].isin(['seat'])) & (df['registrationyear'].isin([1999])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[cordoba99,['model']] = 'cordoba'

ibiza03 = (df['brand'].isin(['seat'])) & (df['registrationyear'] > 2002) & (df['registrationyear'] < 2012) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[ibiza03,['model']] = 'ibiza'

skoda60 = (df['brand'].isin(['skoda'])) & (df['registrationyear'] > 2000) & (df['registrationyear'] != 2013) & (df['power'].isin([60,75])) & (df['model'].isna())
df.loc[skoda60,['model']] = 'fabia'

lancia60 = (df['brand'].isin(['lancia'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[lancia60,['model']] = 'ypsilon'

smart60 = (df['brand'].isin(['smart']))  & (df['power'].isin([60])) & (df['model'].isna())
df.loc[smart60,['model']] = 'fortwo'

smart75 = (df['brand'].isin(['smart'])) & (df['registrationyear'] > 2003) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[smart75,['model']] = 'forfour'

bmw101 = (df['brand'].isin(['bmw'])) & (df['power'].isin([101])) & (df['model'].isna())
df.loc[bmw101,['model']] = '3er'

ford101 = (df['brand'].isin(['ford'])) & (df['power'].isin([101])) & (df['model'].isna())
df.loc[ford101,['model']] = 'focus'

chevy150 = (df['brand'].isin(['chevrolet'])) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[chevy150,['model']] = 'other'

mit150 = (df['brand'].isin(['mitsubishi'])) & (df['registrationyear'] < 1994) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mit150,['model']] = 'other'

mitgalant = (df['brand'].isin(['mitsubishi'])) & (df['registrationyear'] == 1996) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mitgalant,['model']] = 'galant'

mit99 = (df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'sedan') & (df['registrationyear'] == 1999) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mit99,['model']] = 'galant'

mitbus = (df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'bus') & (df['registrationyear'] == 1999) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mitbus,['model']] = 'other'

mitbus00 = (df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'bus') & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mitbus00,['model']] = 'other'

honda101 = (df['brand'].isin(['honda'])) & (df['power'].isin([101])) & (df['model'].isna())
df.loc[honda101,['model']] = 'civic'

honda150 = (df['brand'].isin(['honda'])) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[honda150,['model']] = 'cr_reihe'  

hondasuv = (df['brand'] == 'honda') & (df['model'].isna()) & (df['vehicletype'] == 'suv')
df.loc[hondasuv,['model']] = 'cr_reihe'
In [72]:
topbrand_vt = ['volkswagen']
vt_power = df[(df['brand'].notna()) & (df['brand'].isin(topbrand_vt)) & (df['model'].notna()) & (df['vehicletype'].notna())]

vwvt = vt_power[['vehicletype','model']].value_counts().sort_index()

vwvt.plot(kind = 'bar', figsize = (16,8))
plt.title("Volkswagen: Model & Vehicle Type Abundance")
plt.grid()
plt.show()
No description has been provided for this image
In [73]:
# VW GOLF

golf = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([211,230,174,102, 122, 350, 250, 170, 86, 200,100,109,190,68,80,72,131,144,129,77,160,76,204])) & (df['model'].isna())
df.loc[golf,['model']] = 'golf'

golf02 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([90])) & (df['registrationyear'] > 2002) & (df['model'].isna())
df.loc[golf02,['model']] = 'golf'

golf98 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([60])) & (df['registrationyear'] < 1998) & (df['model'].isna())
df.loc[golf98,['model']] = 'golf'

golf09 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([125,110])) & (df['registrationyear'] == 2009) & (df['model'].isna())
df.loc[golf09,['model']] = 'golf'

golf99 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([150])) & (df['registrationyear'] > 1999) & (df['model'].isna())
df.loc[golf99,['model']] = 'golf'

golf04 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([140])) & (df['registrationyear'] == 2004) & (df['model'].isna())
df.loc[golf04,['model']] = 'golf'

golf91 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([55])) & (df['registrationyear'] != 1991) & (df['model'].isna())
df.loc[golf91,['model']] = 'golf'



# VW POLO

polo = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([64,54])) & (df['model'].isna())
df.loc[polo,['model']] = 'polo'

polo98 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([60])) & (df['registrationyear'] > 1998) & (df['model'].isna())
df.loc[polo98,['model']] = 'polo'



# VW PASSAT 

passat = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([148,136])) & (df['model'].isna())
df.loc[passat,['model']] = 'passat'

passat97 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([125])) & (df['registrationyear'] == 1997) & (df['model'].isna())
df.loc[passat97,['model']] = 'passat'



### VW BEETLE

beetle = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([30])) & (df['model'].isna())
df.loc[beetle,['model']] = 'beetle'



#### VW JETTA

jetta = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([70])) & (df['registrationyear'] == 1981) & (df['model'].isna())
df.loc[jetta,['model']] = 'jetta'



### VW PHAETON

phaeton = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([313,420,240])) & (df['model'].isna())
df.loc[phaeton,['model']] = 'phaeton'

phaeton05 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['registrationyear'] == 2005) & (df['power'].isin([224])) & (df['model'].isna())
df.loc[phaeton05,['model']] = 'phaeton'
In [74]:
trabant = (df['vehicletype'] == 'wagon') & (df['brand'] == 'trabant') & (df['model'].isna())
df.loc[trabant,['model']] = '601'

bmw = (df['vehicletype'] == 'wagon') & (df['registrationyear'] < 1990) & (df['brand'] == 'bmw') & (df['model'].isna())
df.loc[bmw,['model']] = '3er'

vw80 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] < 1990) & (df['brand'] == 'volkswagen') & (df['model'].isna())
df.loc[vw80,['model']] = 'passat'

opel82 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] == 1982) & (df['brand'] == 'opel') & (df['model'].isna())
df.loc[opel82,['model']] = 'kadett'

other82 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] < 1988) & (df['brand'] != 'sonstige_autos') & (df['model'].isna())
df.loc[other82,['model']] = 'other'

volvo89 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] == 1989) & (df['brand'] == 'volvo') & (df['model'].isna())
df.loc[volvo89,['model']] = 'other'

audi100 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] == 1990) & (df['brand'] == 'audi') & (df['model'].isna())
df.loc[audi100,['model']] = '100'

freelander = (df['brand'] == 'land_rover') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([111,115,60,129,140,109,])) & (df['registrationyear'] > 1992) & (df['registrationyear'] < 2006) & (df['model'].isna())
df.loc[freelander,['model']] = 'freelander'

ypsilon = (df['brand'] == 'lancia') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([44,70,74,75,602,1200])) & (df['model'].isna())
df.loc[ypsilon,['model']] = 'ypsilon'

logan = (df['brand'] == 'dacia') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([75,84,85,105])) & (df['registrationyear'].isin([2009,2012,2013,2015])) & (df['model'].isna())
df.loc[logan,['model']] = 'logan'

porscheother = (df['brand'] == 'porsche') & (df['vehicletype'] == 'coupe') & (df['power'].isin([125,160])) & (df['registrationyear'].isin([1981,1989])) & (df['model'].isna())
df.loc[porscheother,['model']] = 'other'

justy = (df['brand'] == 'subaru') & (df['vehicletype'] == 'small') & (df['power'].isin([25,34,50,60,68])) & (df['registrationyear'].isin([1996,1997,2000])) & (df['model'].isna())
df.loc[justy,['model']] = 'justy'

otherrover = (df['brand'] == 'rover') & (df['vehicletype'] == 'sedan') & (df['power'].isin([75,100,111,120,150,16,77,85,105,16,77,85,105,108,116,130,174])) & (df['registrationyear'].isin([1996,1997,1998,1999,2000,2001,2002,2003])) & (df['model'].notna())
df.loc[otherrover,['model']] = 'other'

chryslerother = (df['brand'] == 'chrysler') & (df['vehicletype'] == 'sedan') & (df['power'].isin([133,254,250,85,100,109,122,137,186])) & (df['registrationyear'].isin([1952,1977,1996,1998,1999,2000,2002,2008,2010])) & (df['model'].isna())
df.loc[chryslerother,['model']] = 'other'

voyager = (df['brand'] == 'chrysler') & (df['vehicletype'] == 'bus') & (df['power'].isin([151])) & (df['registrationyear'].isin([1996,1997,1999])) & (df['model'].isna())
df.loc[voyager,['model']] = 'voyager'

t601 = (df['brand'] == 'trabant') & (df['vehicletype'] == 'sedan') & (df['power'].isin([26,45])) & (df['registrationyear'].isin([1982,1988,1989,1977,1986,1984,1998])) & (df['model'].isna())
df.loc[t601,['model']] = '601'

six = (df['brand'] == 'trabant') & (df['vehicletype'].isin(['small','coupe'])) & (df['power'].isin([60,26,75])) & (df['registrationyear'].isin([1988,1998,2004,2008])) & (df['model'].isna())
df.loc[six,['model']] = '601'

otherchevy = (df['brand'] == 'chevrolet') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([64,141,75,94,95,54,109,125,195,163,130,105,124,72,69,60,360])) & (df['registrationyear'].isin([2011,2005,1968,1978,2000,2006,2010,2012])) & (df['model'].isna())
df.loc[otherchevy,['model']] = 'other'

volvoother = (df['brand'] == 'volvo') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([115,131,52,105,113,116])) & (df['registrationyear'].isin([1996,1991,1993,2007,1985,1988,1998,1999,2004,2012])) & (df['model'].isna())
df.loc[volvoother,['model']] = 'other'

kother = (df['brand'] == 'kia') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([105,138,140,48,101,113,133,143,203])) & (df['registrationyear'].isin([2005,2007,2001,2002,2003,2004])) & (df['model'].isna())
df.loc[kother,['model']] = 'other'

rio = (df['brand'] == 'kia') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([97,109,83,98,105,125,138,139,150])) & (df['registrationyear'].isin([2003,2000,2007,1999,2001,2002])) & (df['model'].isna())
df.loc[rio,['model']] = 'rio'

sorento = (df['brand'] == 'kia') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([140,78,110,133,194])) & (df['registrationyear'].isin([2006,2001,2004,1995,1999,2012])) & (df['model'].isna())
df.loc[sorento,['model']] = 'sorento'

civic = (df['brand'] == 'honda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([90,124,125])) & (df['registrationyear'].isin([1992,1991,1993])) & (df['model'].isna())
df.loc[civic,['model']] = 'civic'

jazz = (df['brand'] == 'honda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([90])) & (df['registrationyear'].isin([2010])) & (df['model'].isna())
df.loc[jazz,['model']] = 'jazz'

hother = (df['brand'] == 'honda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([65])) & (df['registrationyear'].isin([1999])) & (df['model'].isna())
df.loc[hother,['model']] = 'other'

civcou = (df['brand'] == 'honda') & (df['vehicletype'].isin(['coupe'])) & (df['power'].isin([90,114,100,105,107,109])) & (df['registrationyear'].isin([2000,1995,1996,1998,1989,1999,2006])) & (df['model'].isna())
df.loc[civcou,['model']] = 'civic'

honother = (df['brand'] == 'honda') & (df['vehicletype'].isin(['coupe'])) & (df['power'].isin([133,185])) & (df['registrationyear'].isin([2000,1992,1998])) & (df['model'].isna())

cother = (df['brand'] == 'honda') & (df['vehicletype'].isin(['coupe'])) & (df['power'].isin([133])) & (df['registrationyear'].isin([2000,1992,1998,])) & (df['model'].isna())
df.loc[cother,['model']] = 'other'

jbus = (df['brand'] == 'honda') & (df['vehicletype'].isin(['bus'])) & (df['power'].isin([90])) & (df['registrationyear'].isin([2010,2012,2013])) & (df['model'].isna())
df.loc[jbus,['model']] = 'jazz'

octavia = (df['brand'] == 'skoda') & (df['price'] > 2099) & (df['price'] < 5701) & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([102,105,150])) & (df['registrationyear'].isin([2001,2005,2007,2008])) & (df['model'].isna())
df.loc[octavia,['model']] = octavia

swift = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([53,50,55,58,92])) & (df['registrationyear'].isin([1997,2000,1998,2003,2008])) & (df['model'].isna())
df.loc[swift,['model']] = 'swift'

suzother = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([63,52,65,76,83,84,57,96])) & (df['registrationyear'].isin([1990, 1995,1996,1999,2002,1997,2001,2004,2007])) & (df['model'].isna())
df.loc[suzother,['model']] = 'other'

ukiother = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([2009,2011,2012])) & (df['model'].isna())
df.loc[ukiother,['model']] = 'other'

jimny = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([86,82,88])) & (df['registrationyear'].isin([2001,2005,2003])) & (df['model'].isna())
df.loc[jimny,['model']] = 'jimny'

zother = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([97,45,68,75,85,98,136,170])) & (df['registrationyear'].isin([1995,1996,1988,1992,1998,2006,2007])) & (df['model'].isna())
df.loc[zother,['model']] = 'other'

carisma = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([125,115])) & (df['registrationyear'].isin([2002,1995,1998,1997,2000,2003])) & (df['model'].isna())
df.loc[carisma,['model']] = 'carisma'

colt = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75,90,95])) & (df['registrationyear'].isin([2002,2009,2000,2006])) & (df['model'].isna())
df.loc[colt,['model']] = 'colt'

coltt = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75,70,95,82,150])) & (df['registrationyear'].isin([1999,1996,1998,2006,2009,1997,2000,2002,2010,2012,2001])) & (df['model'].isna())
df.loc[coltt,['model']] = 'colt'

lancer = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([98])) & (df['registrationyear'].isin([1999,2000,2001,2003,2004,1997,2007])) & (df['model'].isna())
df.loc[lancer,['model']] = 'lancer'

galant = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([160,165])) & (df['registrationyear'].isin([1999,2000,2001,2003,2004,1997,2007])) & (df['model'].isna())
df.loc[galant,['model']] = 'galant'

wother = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([82,86,83,101,132,125])) & (df['registrationyear'].isin([1999,2000,2001,2003,2004])) & (df['model'].isna())
df.loc[wother,['model']] = 'other'

yaris = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75,86,90,87])) & (df['registrationyear'].isin([2008,2000,2001,2002])) & (df['model'].isna())
df.loc[yaris,['model']] = 'yaris'

aygo = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([2008,2006,2009])) & (df['model'].isna())
df.loc[aygo,['model']] = 'aygo'

yar = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([1999,2001])) & (df['model'].isna())
df.loc[yar,['model']] = 'yaris'

cor = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([97])) & (df['registrationyear'].isin([2003,2000,2001])) & (df['model'].isna())
df.loc[cor,['model']] = 'corolla'

corolla = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([86])) & (df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[corolla,['model']] = 'corolla'

sixty = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([1997])) & (df['model'].isna())
df.loc[sixty,['model']] = 'other'

tother = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1993,1995,1997])) & (df['model'].isna())
df.loc[tother,['model']] = 'other'

coro = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75,88,90,110])) & (df['registrationyear'].isin([1993,2006,1995,2008])) & (df['model'].isna())
df.loc[coro,['model']] = 'corolla'

auris = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([177,124,126])) & (df['registrationyear'].isin([2007,2010])) & (df['model'].isna())
df.loc[auris,['model']] = 'auris'

llo = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([72,97,105])) & (df['registrationyear'].isin([1992,2003])) & (df['model'].isna())
df.loc[llo,['model']] = 'corolla'

avensis = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([177])) & (df['model'].isna())
df.loc[avensis,['model']] = 'avensis'

sedoy = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([63,91,180])) & (df['registrationyear'].isin([1993,1998,2009])) & (df['model'].isna())
df.loc[sedoy,['model']] = 'other'

yar = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([1999])) & (df['model'].isna())
df.loc[yar,['model']] = 'yaris'

micra = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([54,65,50,55,40])) & (df['registrationyear'].isin([1994,2009,1998,1995,1999,2000,2004,1991,1996,1997,2008,2013])) & (df['model'].isna())
df.loc[micra,['model']] = 'micra'

micraa = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([65,80])) & (df['registrationyear'].isin([2003,2014])) & (df['model'].isna())
df.loc[micraa,['model']] = 'micra'

micraaa = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([98])) & (df['registrationyear'].isin([2013])) & (df['model'].isna())
df.loc[micraaa,['model']] = 'micra'

qashqai = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2011])) & (df['model'].isna())
df.loc[qashqai,['model']] = 'qashqai'

ibiza = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75,64,86,69,70,85])) & (df['registrationyear'].isin([2002,2001,2003,2011,2007])) & (df['model'].isna())
df.loc[ibiza,['model']] = 'ibiza'

arosa = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([50])) & (df['registrationyear'].isin([1999,2002,1998,2000,2001,1997])) & (df['model'].isna())
df.loc[arosa,['model']] = 'arosa'

ibizaa = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([86,101,69])) & (df['registrationyear'].isin([2006,2012,2013])) & (df['model'].isna())
df.loc[ibizaa,['model']] = 'ibiza'

ibiza1 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([200,51])) & (df['registrationyear'].isin([2009])) & (df['model'].isna())
df.loc[ibiza1,['model']] = 'ibiza'

other1 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([25])) & (df['model'].isna())
df.loc[other1,['model']] = 'other'

cordoba75 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1996,1998])) & (df['model'].isna())
df.loc[cordoba75,['model']] = 'cordoba'

leon07 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([105])) & (df['registrationyear'].isin([2007])) & (df['model'].isna())
df.loc[leon07,['model']] = 'leon'

leon160 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([102,160,265])) & (df['registrationyear'].isin([2007,2008,2009,2012])) & (df['model'].isna())
df.loc[leon160,['model']] = 'leon'

toledo = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([101,150])) & (df['registrationyear'].isin([1998,1999])) & (df['model'].isna())
df.loc[toledo,['model']] = 'toledo'

leon140 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([140])) & (df['registrationyear'].isin([2006])) & (df['model'].isna())
df.loc[leon140,['model']] = 'leon'

toledo150 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2000])) & (df['model'].isna())
df.loc[toledo150,['model']] = 'toledo'

ibiza09 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([86])) & (df['registrationyear'].isin([2009])) & (df['model'].isna())
df.loc[ibiza09,['model']] = 'ibiza'

ibiza07 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([64])) & (df['model'].isna())
df.loc[ibiza07,['model']] = 'ibiza'

getz = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55,82,88,97])) & (df['registrationyear'].isin([2003,2007,2002])) & (df['model'].isna())
df.loc[getz,['model']] = 'getz'

i_reihe = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68,77,78])) & (df['registrationyear'].isin([2010,2009,2011,2007])) & (df['model'].isna())
df.loc[i_reihe,['model']] = 'i_reihe'

getz03 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55,63,67,65,90])) & (df['registrationyear'].isin([2003])) & (df['model'].isna())
df.loc[getz03,['model']] = 'getz'

yother = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([58,54,55,60,75,40])) & (df['registrationyear'].isin([1998,1999,1996,2000,2001,2002])) & (df['model'].isna())
df.loc[yother,['model']] = 'other'

yot = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[yot,['model']] = 'other'

ir = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([67])) & (df['registrationyear'].isin([2010])) & (df['model'].isna())
df.loc[ir,['model']] = 'i_reihe'

other58 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([58])) & (df['model'].isna())
df.loc[other58,['model']] = 'other'

i = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([63,65,79,90])) & (df['registrationyear'].isin([2011])) & (df['model'].isna())
df.loc[i,['model']] = 'i_reihe'

rei = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([109,90,78])) & (df['registrationyear'].isin([2009,2010,2011])) & (df['model'].isna())
df.loc[rei,['model']] = 'i_reihe'

other99 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([82,140,160,235])) & (df['registrationyear'].isin([1999,2003,2005,2006])) & (df['model'].isna())
df.loc[other99,['model']] = 'other'

other94 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75,85,86,131,136,54])) & (df['registrationyear'].isin([1994,2000,2001,2002,2005])) & (df['model'].isna())
df.loc[other94,['model']] = 'other'

santa = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([145,155,170])) & (df['registrationyear'].isin([2003,2002,2004,2006,2008])) & (df['model'].isna())
df.loc[santa,['model']] = 'santa'

he = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([163,140,184])) & (df['registrationyear'].isin([2010,2013])) & (df['model'].isna())
df.loc[he,['model']] = 'i_reihe'

shother = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([163,99])) & (df['registrationyear'].isin([2006,1998,2000,2005])) & (df['model'].isna())
df.loc[shother,['model']] = 'other'

santa140 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([140])) & (df['registrationyear'].isin([2003])) & (df['model'].isna())
df.loc[santa140,['model']] = 'santa'

other150 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2002,2003])) & (df['model'].isna())
df.loc[other150,['model']] = 'other'

santa06 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2006])) & (df['model'].isna())
df.loc[santa06,['model']] = 'santa'

c1 = (df['brand'] == 'citroen') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([2008,2011])) & (df['model'].isna())
df.loc[c1,['model']] = 'c1'

c3 = (df['brand'] == 'citroen') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([73])) & (df['registrationyear'].isin([2003])) & (df['model'].isna())
df.loc[c3,['model']] = 'c3'

othercit = (df['brand'] == 'citroen') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60,75])) & (df['registrationyear'].isin([2001,1999,2000,1998])) & (df['model'].isna())
df.loc[othercit,['model']] = 'other'

fortwo = (df['brand'] == 'smart') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([61,45,54,41,55,71,40,50,72])) & (df['registrationyear'].isin([2005,2002,1999,2001,2000,2012,2004,2003,2008,1998,2007,2011,2009,2014])) & (df['model'].isna())
df.loc[fortwo,['model']] = 'fortwo'

forfour = (df['brand'] == 'smart') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([109])) & (df['registrationyear'].isin([2006])) & (df['model'].isna())
df.loc[forfour,['model']] = 'forfour'

ftvert = (df['brand'] == 'smart') & (df['vehicletype'].isin(['convertible'])) & (df['power'].isin([54,41,45]))  & (df['registrationyear'].isin([2000,2001,2005,2006,2008])) & (df['model'].isna())
df.loc[ftvert,['model']] = 'fortwo'

vertft = (df['brand'] == 'smart') & (df['vehicletype'].isin(['convertible'])) & (df['power'].isin([55,61]))  & (df['registrationyear'].isin([2000,2001,2002])) & (df['model'].isna())
df.loc[vertft,['model']] = 'fortwo'

ft = (df['brand'] == 'smart') & (df['vehicletype'].isin(['convertible'])) & (df['power'].isin([84]))  & (df['registrationyear'].isin([2009])) & (df['model'].isna())
df.loc[ft,['model']] = 'fortwo'

sixre = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([115,116,166,141,120])) & (df['registrationyear'].isin([1999,2003])) & (df['model'].isna())
df.loc[sixre,['model']] = '6_reihe'

sre = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([90])) & (df['registrationyear'].isin([1997,1998,1996,2000,1990])) & (df['model'].isna())
df.loc[sre,['model']] = '6_reihe'

mazother = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([144,163,109])) & (df['registrationyear'].isin([1997,2001,1993])) & (df['model'].isna())
df.loc[mazother,['model']] = 'other'

three = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([98])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[three,['model']] = '3_reihe'

three88 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([88])) & (df['registrationyear'].isin([1997,1995,1998,1996])) & (df['model'].isna())
df.loc[three88,['model']] = '3_reihe'

rh6 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([166,141,163])) & (df['registrationyear'].isin([2002,2010])) & (df['model'].isna())
df.loc[rh6,['model']] = '6_reihe'

thei = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([105,73])) & (df['registrationyear'].isin([1996,2006,1997,2005,2008])) & (df['model'].isna())
df.loc[thei,['model']] = '3_reihe'

ihth = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([144,98,114,150,109,86,75])) & (df['registrationyear'].isin([1999,1995,2003,2006,2000,2010])) & (df['model'].isna())
df.loc[ihth,['model']] = '3_reihe'

eeh = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([73])) & (df['registrationyear'].isin([1997,1996,2000])) & (df['model'].isna())
df.loc[eeh,['model']] = '3_reihe'

hee = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1997,1999])) & (df['model'].isna())
df.loc[hee,['model']] = '3_reihe'

ri3 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([88,98,65,109])) & (df['registrationyear'].isin([1998,1999,1996,2002,2003,2006,2008])) & (df['model'].isna())
df.loc[ri3,['model']] = '3_reihe'

reihe373 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([73])) & (df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[reihe373,['model']] = '3_reihe'

other7509 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([2009,2013])) & (df['model'].isna())
df.loc[other7509,['model']] = 'other'

reihe1 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1995])) & (df['model'].isna())
df.loc[reihe1,['model']] = '1_reihe'

punto60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2000.0, 2001.0, 2002.0, 2003.0, 1999.0, 1998.0,
              1997.0, 1996.0, 1993.0, 1994.0])) & (df['model'].isna())
df.loc[punto60,['model']] = 'punto'

panda60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2010.0, 2008.0, 2011.0, 1991.0])) & (df['model'].isna())
df.loc[panda60,['model']] = 'panda'

seicento60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55])) & (df['registrationyear'].isin([2000,2001])) & (df['model'].isna())
df.loc[seicento60,['model']] = 'seicento'

punto65 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([65])) & (df['registrationyear'].isin([2010.0, 2000.0, 1999.0,
              1996.0, 1998.0, 2001.0, 2003.0, 2004.0])) & (df['model'].isna())
df.loc[punto65,['model']] = 'punto'

punto01 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60, 80, 44, 75, 90, 65, 85, 64, 68, 86])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[punto01,['model']] = 'punto'

seicento01 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55,50])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[seicento01,['model']] = 'seicento'

stilo170 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([170])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[stilo170,['model']] = 'stilo'

other101 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([101])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[other101,['model']] = 'other'

punto98 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60, 86, 65, 75, 44])) & (df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[punto98,['model']] = 'punto'

five69 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([69])) & (df['registrationyear'].isin([2008.0, 2009.0, 2010.0, 2013.0])) & (df['model'].isna())
df.loc[five69,['model']] = '500'

puntorand = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([80, 86, 85, 69, 64])) & (df['registrationyear'].isin([1999,2003,2000])) & (df['model'].isna())
df.loc[puntorand,['model']] = 'punto'

stilo103 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([103, 80, 170, 115, 102])) & (df['registrationyear'].isin([2002.0, 2003.0, 2004.0, 2005.0])) & (df['model'].isna())
df.loc[stilo103,['model']] = 'stilo'

bravo150 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2007.0, 2008.0])) & (df['model'].isna())
df.loc[bravo150,['model']] = 'bravo'

bravo08 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2008.0])) & (df['model'].isna())
df.loc[bravo08,['model']] = 'bravo'

punto60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2003,2000])) & (df['model'].isna())
df.loc[punto60,['model']] = 'punto'

re2 = (df['brand'] == 'peugeot') & (df['vehicletype'].isin(['small']))& (df['power'].isin([60])) & (df['registrationyear'].isin([2004.0, 2005.0,
              2011.0, 2010.0, 1990.0])) & (df['model'].isna())
df.loc[re2,['model']] = '2_reihe'

twore = (df['brand'] == 'peugeot') & (df['vehicletype'].isin(['convertible']))& (df['power'].isin([120,109])) & (df['registrationyear'].isin([2003.0, 2002.0, 2004.0, 2005.0, 2011.0, 2012.0])) & (df['model'].isna())
df.loc[twore,['model']] = '2_reihe'

fiestarand = (df['brand'] == 'ford') & (df['vehicletype'].isin(['small']))& (df['power'].isin([82, 150,182,  61,81])) & (df['registrationyear'].isin([2006.0, 2009.0, 2014.0, 2000.0, 2005.0])) & (df['model'].isna())
df.loc[fiestarand,['model']] = 'fiesta'

fiestaa = (df['brand'] == 'ford') & (df['vehicletype'].isin(['small']))& (df['power'].isin([60])) & (df['registrationyear'].isin([1992.0, 2010.0])) & (df['model'].isna())
df.loc[fiestaa,['model']] = 'fiesta'

fiestaaa = (df['brand'] == 'ford') & (df['vehicletype'].isin(['small']))& (df['power'].isin([50, 75, 54, 66, 103])) & (df['registrationyear'].isin([2000])) & (df['model'].isna())
df.loc[fiestaaa,['model']] = 'fiesta'
In [75]:
del brand_power1
del brand_power2
del brand_power3
del brand_power4
del brand_power5
del brand_power6
del top5_brand_power
del over1000_brand_power
del under1000_brand_power
del top5_brands 
del top5_brands2
del top5_brands3
del top5_brands4
del top5_brands5
del top5_brands6
del middle_brands
del middle_brands2
del middle_brands3
del middle_brands4
del middle_brands5
del middle_brands6
del lower_brands
del lower_brands2
del lower_brands3
del lower_brands4
del lower_brands5
del lower_brands6
del top5
del top52
del top53
del top54
del top55
del top56
del middle
del middle2
del middle3
del middle4
del middle5
del middle6
del lower
del lower2
del lower3
del lower4
del lower5
del lower6
In [76]:
def fill_missing_models(df):
    df = df.copy()

    # Split data into known and missing model subsets
    known = df[df['model'].notna()]
    missing = df[df['model'].isna()]

    # --- Step 1: Keep only combinations that map to exactly one model ---
    unique_models = (
        known.groupby(['brand', 'vehicletype', 'power', 'registrationyear'])['model']
        .nunique()
        .reset_index(name='model_count')
    )

    # Only combos with one unique model (avoid ambiguous mappings)
    unique_keys = unique_models[unique_models['model_count'] == 1].drop(columns='model_count')

    # Merge these unique combos with their actual model name
    unique_known = known.merge(unique_keys, on=['brand', 'vehicletype', 'power', 'registrationyear'])
    unique_known = unique_known[['brand', 'vehicletype', 'power', 'registrationyear', 'model']].drop_duplicates()

    # --- Step 2: Merge to fill missing models safely ---
    filled = missing.merge(
        unique_known,
        on=['brand', 'vehicletype', 'power', 'registrationyear'],
        how='left',
        suffixes=('', '_known')
    )

    # Fill in model from the unique match
    filled['model'] = filled['model_known'].combine_first(filled['model'])
    filled = filled.drop(columns=['model_known'])

    # --- Step 3: Combine back with known data ---
    result = pd.concat([known, filled], ignore_index=True)

    return result
In [77]:
df_new = df.copy()
df_new = fill_missing_models(df_new)

df_new.isna().sum()
Out[77]:
datecrawled                    0
price                          0
vehicletype                37471
registrationyear               0
gearbox                    19830
power                          0
model                      15662
mileage                        0
registrationmonth              0
fueltype                   32889
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
dtype: int64
In [78]:
df_new.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354107 entries, 0 to 354106
Data columns (total 17 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   datecrawled              354107 non-null  object        
 1   price                    354107 non-null  int64         
 2   vehicletype              316636 non-null  object        
 3   registrationyear         354107 non-null  float64       
 4   gearbox                  334277 non-null  object        
 5   power                    354107 non-null  int64         
 6   model                    338445 non-null  object        
 7   mileage                  354107 non-null  int64         
 8   registrationmonth        354107 non-null  int64         
 9   fueltype                 321218 non-null  object        
 10  brand                    354107 non-null  object        
 11  notrepaired              282962 non-null  object        
 12  datecreated              354107 non-null  datetime64[ns]
 13  numberofpictures         354107 non-null  int64         
 14  postalcode               354107 non-null  int64         
 15  lastseen                 354107 non-null  object        
 16  registration_correction  354107 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(6), object(9)
memory usage: 45.9+ MB
In [79]:
def analyze_missing_models(df, brand):
    # Focus on the brand
    brand_df = df[df['brand'] == brand]
    
    # Step 1: Check which vehicle types are most common for missing models
    vt_counts = brand_df[brand_df['model'].isna()]['vehicletype'].value_counts()
    print(f"\n--- {brand.upper()} ---")
    print("Vehicle types with missing models:")
    print(vt_counts)

    # Step 2: For each vehicle type, show power distribution
    for vt in vt_counts.index:
        subset = brand_df[(brand_df['model'].isna()) & (brand_df['vehicletype'] == vt)]
        pw_counts = subset['power'].value_counts()
        print(f"\n{vt}: Power distribution for missing models")
        print(pw_counts)
        print(pw_counts.index)

        # Step 3: Show registration year distribution
        reg_counts = subset['registrationyear'].value_counts()
        print(f"\n{vt}: Registration year distribution for missing models")
        print(reg_counts)
        print(reg_counts.index)

analyze_missing_models(df_new, 'ford')
--- FORD ---
Vehicle types with missing models:
small          198
wagon          101
sedan           69
bus             43
coupe           30
suv             13
other           13
convertible      9
Name: vehicletype, dtype: int64

small: Power distribution for missing models
0      67
60     60
50     16
75     16
90      5
80      5
55      4
44      3
70      3
45      3
68      2
65      2
100     2
69      1
67      1
71      1
74      1
59      1
95      1
96      1
110     1
116     1
118     1
Name: power, dtype: int64
Int64Index([  0,  60,  50,  75,  90,  80,  55,  44,  70,  45,  68,  65, 100,
             69,  67,  71,  74,  59,  95,  96, 110, 116, 118],
           dtype='int64')

small: Registration year distribution for missing models
1999.0    30
1998.0    25
1997.0    23
2000.0    21
2002.0    21
2001.0    17
2004.0    12
1996.0    11
2003.0    11
2005.0     8
2006.0     5
1990.0     5
2007.0     3
1978.0     1
2014.0     1
2009.0     1
2008.0     1
1994.0     1
1992.0     1
Name: registrationyear, dtype: int64
Float64Index([1999.0, 1998.0, 1997.0, 2000.0, 2002.0, 2001.0, 2004.0, 1996.0,
              2003.0, 2005.0, 2006.0, 1990.0, 2007.0, 1978.0, 2014.0, 2009.0,
              2008.0, 1994.0, 1992.0],
             dtype='float64')

wagon: Power distribution for missing models
0      36
115    14
116    11
90      9
131     7
109     4
100     4
120     3
75      3
117     2
105     1
128     1
89      1
60      1
170     1
150     1
140     1
125     1
Name: power, dtype: int64
Int64Index([  0, 115, 116,  90, 131, 109, 100, 120,  75, 117, 105, 128,  89,
             60, 170, 150, 140, 125],
           dtype='int64')

wagon: Registration year distribution for missing models
2001.0    13
1998.0    12
2000.0    12
1999.0    12
2005.0    11
2002.0     9
2003.0     8
2004.0     7
1997.0     4
1996.0     4
2006.0     3
2008.0     2
2007.0     2
1995.0     1
1990.0     1
Name: registrationyear, dtype: int64
Float64Index([2001.0, 1998.0, 2000.0, 1999.0, 2005.0, 2002.0, 2003.0, 2004.0,
              1997.0, 1996.0, 2006.0, 2008.0, 2007.0, 1995.0, 1990.0],
             dtype='float64')

sedan: Power distribution for missing models
0       19
75       7
90       7
116      5
115      4
110      3
95       2
50       2
77       2
226      2
148      1
109      1
1002     1
105      1
230      1
1120     1
94       1
29       1
136      1
131      1
85       1
145      1
66       1
170      1
38       1
89       1
Name: power, dtype: int64
Int64Index([   0,   75,   90,  116,  115,  110,   95,   50,   77,  226,  148,
             109, 1002,  105,  230, 1120,   94,   29,  136,  131,   85,  145,
              66,  170,   38,   89],
           dtype='int64')

sedan: Registration year distribution for missing models
1998.0    10
1999.0     8
1997.0     6
2000.0     5
1996.0     5
2001.0     4
2002.0     4
2006.0     3
2005.0     3
1995.0     2
1993.0     2
2009.0     2
1989.0     2
2013.0     1
1976.0     1
1960.0     1
1940.0     1
1970.0     1
2007.0     1
1994.0     1
1977.0     1
2004.0     1
1988.0     1
1978.0     1
1979.0     1
1967.0     1
Name: registrationyear, dtype: int64
Float64Index([1998.0, 1999.0, 1997.0, 2000.0, 1996.0, 2001.0, 2002.0, 2006.0,
              2005.0, 1995.0, 1993.0, 2009.0, 1989.0, 2013.0, 1976.0, 1960.0,
              1940.0, 1970.0, 2007.0, 1994.0, 1977.0, 2004.0, 1988.0, 1978.0,
              1979.0, 1967.0],
             dtype='float64')

bus: Power distribution for missing models
0      15
116     6
125     4
75      2
140     2
90      2
131     2
135     1
147     1
80      1
130     1
146     1
211     1
128     1
98      1
175     1
145     1
Name: power, dtype: int64
Int64Index([0, 116, 125, 75, 140, 90, 131, 135, 147, 80, 130, 146, 211, 128,
            98, 175, 145],
           dtype='int64')

bus: Registration year distribution for missing models
2005.0    8
2001.0    5
2009.0    4
1998.0    4
2006.0    3
2003.0    3
1999.0    3
2008.0    2
2007.0    2
1997.0    2
1996.0    2
2000.0    2
1993.0    1
1992.0    1
2004.0    1
Name: registrationyear, dtype: int64
Float64Index([2005.0, 2001.0, 2009.0, 1998.0, 2006.0, 2003.0, 1999.0, 2008.0,
              2007.0, 1997.0, 1996.0, 2000.0, 1993.0, 1992.0, 2004.0],
             dtype='float64')

coupe: Power distribution for missing models
0      10
130     5
131     2
90      2
100     1
69      1
136     1
138     1
140     1
132     1
179     1
145     1
120     1
122     1
125     1
Name: power, dtype: int64
Int64Index([0, 130, 131, 90, 100, 69, 136, 138, 140, 132, 179, 145, 120, 122,
            125],
           dtype='int64')

coupe: Registration year distribution for missing models
2002.0    7
2000.0    7
2001.0    3
1999.0    3
1995.0    2
2006.0    2
1998.0    2
1980.0    1
1997.0    1
1978.0    1
2009.0    1
Name: registrationyear, dtype: int64
Float64Index([2002.0, 2000.0, 2001.0, 1999.0, 1995.0, 2006.0, 1998.0, 1980.0,
              1997.0, 1978.0, 2009.0],
             dtype='float64')

suv: Power distribution for missing models
124    4
0      3
150    1
340    1
165    1
196    1
203    1
125    1
Name: power, dtype: int64
Int64Index([124, 0, 150, 340, 165, 196, 203, 125], dtype='int64')

suv: Registration year distribution for missing models
1994.0    5
2009.0    1
1987.0    1
2001.0    1
2003.0    1
1977.0    1
2004.0    1
1989.0    1
2006.0    1
Name: registrationyear, dtype: int64
Float64Index([1994.0, 2009.0, 1987.0, 2001.0, 2003.0, 1977.0, 2004.0, 1989.0,
              2006.0],
             dtype='float64')

other: Power distribution for missing models
0      3
157    2
226    1
70     1
80     1
205    1
240    1
115    1
109    1
175    1
Name: power, dtype: int64
Int64Index([0, 157, 226, 70, 80, 205, 240, 115, 109, 175], dtype='int64')

other: Registration year distribution for missing models
1993.0    2
1984.0    2
1964.0    1
2008.0    1
1959.0    1
2001.0    1
1953.0    1
1996.0    1
2000.0    1
2005.0    1
2006.0    1
Name: registrationyear, dtype: int64
Float64Index([1993.0, 1984.0, 1964.0, 2008.0, 1959.0, 2001.0, 1953.0, 1996.0,
              2000.0, 2005.0, 2006.0],
             dtype='float64')

convertible: Power distribution for missing models
95     3
90     2
0      1
116    1
70     1
190    1
Name: power, dtype: int64
Int64Index([95, 90, 0, 116, 70, 190], dtype='int64')

convertible: Registration year distribution for missing models
2004.0    4
1997.0    1
2003.0    1
1999.0    1
1996.0    1
1992.0    1
Name: registrationyear, dtype: int64
Float64Index([2004.0, 1997.0, 2003.0, 1999.0, 1996.0, 1992.0], dtype='float64')
In [80]:
def analyze_missing_models(df, brand):
    # Focus on the brand
    brand_df = df[df['brand'] == brand]
    
    # Step 1: Check which vehicle types are most common for missing models
    vt_counts = brand_df[brand_df['model'].isna()]['vehicletype'].value_counts()
    print(f"\n--- {brand.upper()} ---")
    print("Vehicle types with missing models:")
    print(vt_counts)

    # Step 2: For each vehicle type, show power distribution
    for vt in vt_counts.index:
        subset = brand_df[(brand_df['model'].isna()) & (brand_df['vehicletype'] == vt)]
        pw_counts = subset['power'].value_counts()
        print(f"\n{vt}: Power distribution for missing models")
        print(pw_counts)
        print(pw_counts.index)

        # Step 3: Show registration year distribution
        reg_counts = subset['registrationyear'].value_counts()
        print(f"\n{vt}: Registration year distribution for missing models")
        print(reg_counts)
        print(reg_counts.index)

analyze_missing_models(df_new, 'mercedes_benz')
--- MERCEDES_BENZ ---
Vehicle types with missing models:
sedan          315
wagon          136
coupe           64
bus             37
convertible     24
other           16
suv             15
small           14
Name: vehicletype, dtype: int64

sedan: Power distribution for missing models
0        93
136      23
122      21
170      16
150      15
224      12
143      11
204      10
163      10
75        8
109       8
160       6
306       6
125       6
184       5
177       5
193       4
116       4
118       3
102       3
197       3
95        3
132       3
108       2
190       2
220       2
90        2
87        2
65        2
272       2
265       1
234       1
300       1
278       1
218       1
387       1
388       1
156       1
186       1
16051     1
174       1
166       1
161       1
52        1
142       1
140       1
123       1
110       1
107       1
103       1
88        1
86        1
10912     1
Name: power, dtype: int64
Int64Index([    0,   136,   122,   170,   150,   224,   143,   204,   163,
               75,   109,   160,   306,   125,   184,   177,   193,   116,
              118,   102,   197,    95,   132,   108,   190,   220,    90,
               87,    65,   272,   265,   234,   300,   278,   218,   387,
              388,   156,   186, 16051,   174,   166,   161,    52,   142,
              140,   123,   110,   107,   103,    88,    86, 10912],
           dtype='int64')

sedan: Registration year distribution for missing models
2002.0    26
1999.0    20
2000.0    20
1996.0    19
2001.0    19
2003.0    17
1992.0    17
1990.0    13
1997.0    13
1998.0    13
1991.0    13
2005.0    12
2007.0    11
1989.0    10
2008.0    10
1995.0    10
2006.0     9
1993.0     8
2004.0     7
1982.0     7
1994.0     7
1987.0     5
1986.0     5
1988.0     4
2010.0     4
1983.0     4
1985.0     2
1981.0     2
1968.0     1
1967.0     1
1974.0     1
1966.0     1
2012.0     1
1976.0     1
2009.0     1
1971.0     1
Name: registrationyear, dtype: int64
Float64Index([2002.0, 1999.0, 2000.0, 1996.0, 2001.0, 2003.0, 1992.0, 1990.0,
              1997.0, 1998.0, 1991.0, 2005.0, 2007.0, 1989.0, 2008.0, 1995.0,
              2006.0, 1993.0, 2004.0, 1982.0, 1994.0, 1987.0, 1986.0, 1988.0,
              2010.0, 1983.0, 1985.0, 1981.0, 1968.0, 1967.0, 1974.0, 1966.0,
              2012.0, 1976.0, 2009.0, 1971.0],
             dtype='float64')

wagon: Power distribution for missing models
0      34
150    16
122    14
170    11
136     8
163     8
116     7
204     5
143     5
125     5
90      4
224     4
130     2
177     2
193     2
132     1
272     1
115     1
102     1
165     1
184     1
196     1
280     1
205     1
Name: power, dtype: int64
Int64Index([  0, 150, 122, 170, 136, 163, 116, 204, 143, 125,  90, 224, 130,
            177, 193, 132, 272, 115, 102, 165, 184, 196, 280, 205],
           dtype='int64')

wagon: Registration year distribution for missing models
1997.0    17
1998.0    15
2003.0    14
2002.0    12
2008.0    10
2000.0     8
1999.0     8
2001.0     7
1996.0     6
2004.0     6
2006.0     5
1989.0     5
2005.0     4
2010.0     4
1994.0     3
1993.0     3
1992.0     2
1991.0     2
2007.0     2
1995.0     2
2009.0     1
Name: registrationyear, dtype: int64
Float64Index([1997.0, 1998.0, 2003.0, 2002.0, 2008.0, 2000.0, 1999.0, 2001.0,
              1996.0, 2004.0, 2006.0, 1989.0, 2005.0, 2010.0, 1994.0, 1993.0,
              1992.0, 1991.0, 2007.0, 1995.0, 2009.0],
             dtype='float64')

coupe: Power distribution for missing models
0      9
163    8
136    5
306    5
170    5
197    4
200    3
272    3
192    2
305    2
109    2
231    2
224    2
218    2
143    2
132    2
186    1
193    1
150    1
208    1
500    1
122    1
Name: power, dtype: int64
Int64Index([  0, 163, 136, 306, 170, 197, 200, 272, 192, 305, 109, 231, 224,
            218, 143, 132, 186, 193, 150, 208, 500, 122],
           dtype='int64')

coupe: Registration year distribution for missing models
2002.0    13
2000.0     6
2004.0     5
2001.0     5
2005.0     4
2006.0     4
2003.0     3
1999.0     3
1982.0     3
1998.0     3
1978.0     2
2007.0     2
1988.0     2
1991.0     1
2008.0     1
1997.0     1
2010.0     1
1972.0     1
1995.0     1
1990.0     1
1992.0     1
1984.0     1
Name: registrationyear, dtype: int64
Float64Index([2002.0, 2000.0, 2004.0, 2001.0, 2005.0, 2006.0, 2003.0, 1999.0,
              1982.0, 1998.0, 1978.0, 2007.0, 1988.0, 1991.0, 2008.0, 1997.0,
              2010.0, 1972.0, 1995.0, 1990.0, 1992.0, 1984.0],
             dtype='float64')

bus: Power distribution for missing models
0      14
122     5
150     4
70      2
129     1
130     1
200     1
85      1
90      1
156     1
95      1
100     1
109     1
110     1
116     1
55      1
Name: power, dtype: int64
Int64Index([0, 122, 150, 70, 129, 130, 200, 85, 90, 156, 95, 100, 109, 110,
            116, 55],
           dtype='int64')

bus: Registration year distribution for missing models
2001.0    7
2002.0    5
2006.0    4
2008.0    3
2007.0    3
2000.0    3
1994.0    2
2005.0    2
2004.0    2
2009.0    2
2003.0    2
1998.0    1
1999.0    1
Name: registrationyear, dtype: int64
Float64Index([2001.0, 2002.0, 2006.0, 2008.0, 2007.0, 2000.0, 1994.0, 2005.0,
              2004.0, 2009.0, 2003.0, 1998.0, 1999.0],
             dtype='float64')

convertible: Power distribution for missing models
0      6
163    4
326    3
136    2
170    2
193    1
231    1
198    1
240    1
218    1
220    1
168    1
Name: power, dtype: int64
Int64Index([0, 163, 326, 136, 170, 193, 231, 198, 240, 218, 220, 168], dtype='int64')

convertible: Registration year distribution for missing models
2004.0    5
2001.0    3
1992.0    3
2000.0    3
2007.0    2
2002.0    2
1984.0    1
1993.0    1
1968.0    1
1960.0    1
1998.0    1
2005.0    1
Name: registrationyear, dtype: int64
Float64Index([2004.0, 2001.0, 1992.0, 2000.0, 2007.0, 2002.0, 1984.0, 1993.0,
              1968.0, 1960.0, 1998.0, 2005.0],
             dtype='float64')

other: Power distribution for missing models
0      8
75     2
129    1
99     1
116    1
72     1
90     1
79     1
Name: power, dtype: int64
Int64Index([0, 75, 129, 99, 116, 72, 90, 79], dtype='int64')

other: Registration year distribution for missing models
2001.0    2
2006.0    1
1999.0    1
2016.0    1
1992.0    1
2013.0    1
1981.0    1
2007.0    1
1971.0    1
1997.0    1
1988.0    1
1983.0    1
2000.0    1
1993.0    1
1991.0    1
Name: registrationyear, dtype: int64
Float64Index([2001.0, 2006.0, 1999.0, 2016.0, 1992.0, 2013.0, 1981.0, 2007.0,
              1971.0, 1997.0, 1988.0, 1983.0, 2000.0, 1993.0, 1991.0],
             dtype='float64')

suv: Power distribution for missing models
0      3
190    3
163    2
165    2
224    2
150    1
167    1
250    1
Name: power, dtype: int64
Int64Index([0, 190, 163, 165, 224, 150, 167, 250], dtype='int64')

suv: Registration year distribution for missing models
2007.0    4
2001.0    2
2000.0    2
2008.0    1
1998.0    1
1989.0    1
2002.0    1
2003.0    1
2005.0    1
2006.0    1
Name: registrationyear, dtype: int64
Float64Index([2007.0, 2001.0, 2000.0, 2008.0, 1998.0, 1989.0, 2002.0, 2003.0,
              2005.0, 2006.0],
             dtype='float64')

small: Power distribution for missing models
0      6
75     2
108    2
74     1
90     1
125    1
62     1
Name: power, dtype: int64
Int64Index([0, 75, 108, 74, 90, 125, 62], dtype='int64')

small: Registration year distribution for missing models
2000.0    5
1998.0    2
2004.0    2
2006.0    2
2008.0    1
1990.0    1
2002.0    1
Name: registrationyear, dtype: int64
Float64Index([2000.0, 1998.0, 2004.0, 2006.0, 2008.0, 1990.0, 2002.0], dtype='float64')
In [81]:
df_new[(df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([102,82,88])) & \
    (df_new['model'].isna())].value_counts(subset = 'registrationyear').index
Out[81]:
Float64Index([2000.0, 1989.0, 1999.0], dtype='float64', name='registrationyear')
In [82]:
df_new[(df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['registrationyear'].isin([2000])) & (df_new['model'].isna())].value_counts(subset = 'power').index
Out[82]:
Int64Index([0, 163, 143, 170, 88, 102, 116, 160, 197, 265, 306], dtype='int64', name='power')
In [83]:
df_new[(df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([102,82,88])) & \
    (df_new['registrationyear'].isin([2000])) & (df_new['model'].notna())].value_counts(subset = 'model')
Out[83]:
model
a_klasse    138
c_klasse      6
e_klasse      1
dtype: int64
In [84]:
c = (df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([122])) & (df_new['registrationyear'].isin([1996.0, 1994.0, 1995.0, 1997.0, 1998.0, 1999.0])) & (df_new['model'].isna())
df_new.loc[c,['model']] = 'c_klasse'

e = (df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([150, 177,306])) & (df_new['registrationyear'].isin([2002])) & (df_new['model'].isna())
df_new.loc[e,['model']] = 'e_klasse'

a = (df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([102,82,88])) & (df_new['registrationyear'].isin([2000])) & (df_new['model'].isna())
df_new.loc[a,['model']] = 'a_klasse'
In [85]:
def fill_missing_models_majority(df, threshold=0.9):
    df = df.copy()

    # Split known vs missing
    known = df[df['model'].notna()]
    missing = df[df['model'].isna()]

    # Step 1: compute dominant model per combo and its proportion
    model_stats = (
        known.groupby(['brand', 'vehicletype', 'power', 'registrationyear', 'model'])
        .size()
        .groupby(level=[0, 1, 2, 3])
        .apply(lambda x: x / x.sum())  # convert to proportions
        .reset_index(name='model_share')
    )

    # Step 2: keep only those combos where a single model dominates (≥ threshold)
    dominant = (
        model_stats[model_stats['model_share'] >= threshold]
        .sort_values('model_share', ascending=False)
        .drop_duplicates(subset=['brand', 'vehicletype', 'power', 'registrationyear'])
    )

    # Step 3: merge and fill
    filled = missing.merge(
        dominant[['brand', 'vehicletype', 'power', 'registrationyear', 'model']],
        on=['brand', 'vehicletype', 'power', 'registrationyear'],
        how='left',
        suffixes=('', '_pred')
    )

    filled['model'] = filled['model_pred'].combine_first(filled['model'])
    filled = filled.drop(columns=['model_pred'])

    # Step 4: combine back with known
    result = pd.concat([known, filled], ignore_index=True)

    return result
In [86]:
df_new.isna().sum()
Out[86]:
datecrawled                    0
price                          0
vehicletype                37471
registrationyear               0
gearbox                    19830
power                          0
model                      15644
mileage                        0
registrationmonth              0
fueltype                   32889
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
dtype: int64
In [87]:
df_newer = fill_missing_models_majority(df_new, threshold = 0.9)
df_newer.isna().sum()
Out[87]:
datecrawled                    0
price                          0
vehicletype                37471
registrationyear               0
gearbox                    19830
power                          0
model                      14878
mileage                        0
registrationmonth              0
fueltype                   32889
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
dtype: int64
In [88]:
def fill_missing_vehicletype(df, threshold=0.9):
    df = df.copy()
    
    # Split known vs missing
    known = df[df['vehicletype'].notna()]
    missing = df[df['vehicletype'].isna()]
    
    if missing.empty:
        return df  # nothing to fill
    
    # Step 1: compute dominant vehicletype per group
    vehicletype_stats = (
        known.groupby(['brand', 'model', 'power', 'registrationyear'])['vehicletype']
        .value_counts(normalize=True)  # fraction per type
        .rename('fraction')
        .reset_index()
    )
    
    # Keep only dominant types above threshold
    dominant_types = (
        vehicletype_stats[vehicletype_stats['fraction'] >= threshold]
        .sort_values('fraction', ascending=False)
        .drop_duplicates(subset=['brand', 'model', 'power', 'registrationyear'])
    )
    
    # Step 2: merge dominant types into missing
    missing_filled = missing.merge(
        dominant_types[['brand','model','power','registrationyear','vehicletype']],
        on=['brand','model','power','registrationyear'],
        how='left',
        suffixes=('', '_pred')
    )
    
    # Fill in vehicletype from dominant type
    missing_filled['vehicletype'] = missing_filled['vehicletype_pred'].combine_first(missing_filled['vehicletype'])
    missing_filled = missing_filled.drop(columns=['vehicletype_pred'])
    
    # Step 3: combine back with known
    result = pd.concat([known, missing_filled], ignore_index=True)
    
    return result
In [89]:
df_newest = fill_missing_vehicletype(df_newer)
df_newest.isna().sum()
Out[89]:
datecrawled                    0
price                          0
vehicletype                33013
registrationyear               0
gearbox                    19830
power                          0
model                      14878
mileage                        0
registrationmonth              0
fueltype                   32889
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
dtype: int64
In [90]:
def fill_zero_power(df, group_cols=None, threshold=0.9):
    """
    Fill zero horsepower (HP) values using mode-based imputation with a confidence threshold.

    group_cols : list of str, optional
        Columns to group by when determining mode HP. 
        Default: ['brand', 'model', 'fueltype', 'registrationyear']

    returns df : 
        DataFrame with zero HP values filled where confident mode exists.
    """

    # Default grouping columns
    if group_cols is None:
        group_cols = ['brand', 'model', 'vehicletype','fueltype', 'registrationyear']

    df = df.copy()  # Work on a copy to avoid side effects

    # Step 1: Compute mode HP for each group
    hp_mode_stats = (
        df[df['power'] > 0]  # Only consider valid HPs
        .groupby(group_cols)['power']
        .agg(lambda x: x.mode()[0] if not x.mode().empty else None)
        .reset_index(name='mode_hp')
    )

    # Step 2: Compute mode frequency (confidence)
    hp_freq_stats = (
        df[df['power'] > 0]
        .groupby(group_cols)['power']
        .value_counts(normalize=True)
        .groupby(level=list(range(len(group_cols))))  # Group again by same keys
        .max()
        .reset_index(name='mode_freq')
    )

    # Step 3: Keep only groups where mode occurs ≥ threshold fraction of the time
    hp_stats = pd.merge(hp_mode_stats, hp_freq_stats, on=group_cols)
    hp_stats = hp_stats[hp_stats['mode_freq'] >= threshold]

    # Step 4: Merge imputation info back to df
    df = df.merge(hp_stats, on=group_cols, how='left')

    # Step 5: Fill zeros only where confident mode exists
    df['power'] = df.apply(
        lambda row: row['mode_hp'] if row['power'] == 0 and pd.notna(row['mode_hp']) else row['power'],
        axis=1
    )

    # Step 6: Clean up helper columns
    df = df.drop(columns=['mode_hp', 'mode_freq'], errors='ignore')

    return df
In [91]:
df_car = fill_zero_power(df_newest)
df_car.isna().sum()
Out[91]:
datecrawled                    0
price                          0
vehicletype                33013
registrationyear               0
gearbox                    19830
power                          0
model                      14878
mileage                        0
registrationmonth              0
fueltype                   32889
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
dtype: int64
In [92]:
def fill_missing_fueltype(df, group_cols=None, threshold=0.9):

    if group_cols is None:
        group_cols = ['brand', 'model', 'power', 'vehicletype', 'registrationyear']

    df = df.copy()

    # Step 1: Compute mode fueltype per group
    fuel_mode_stats = (
        df[df['fueltype'].notna()]
        .groupby(group_cols)['fueltype']
        .agg(lambda x: x.mode()[0] if not x.mode().empty else None)
        .reset_index(name='mode_fueltype')
    )

    # Step 2: Compute how dominant (confident) that mode is
    fuel_freq_stats = (
        df[df['fueltype'].notna()]
        .groupby(group_cols)['fueltype']
        .value_counts(normalize=True)
        .groupby(level=list(range(len(group_cols))))
        .max()
        .reset_index(name='mode_freq')
    )

    # Step 3: Keep only groups with strong mode agreement
    fuel_stats = pd.merge(fuel_mode_stats, fuel_freq_stats, on=group_cols)
    fuel_stats = fuel_stats[fuel_stats['mode_freq'] >= threshold]

    # Step 4: Merge back and fill missing
    df = df.merge(fuel_stats, on=group_cols, how='left')

    df['fueltype'] = df.apply(
        lambda row: row['mode_fueltype'] if pd.isna(row['fueltype']) and pd.notna(row['mode_fueltype'])
        else row['fueltype'],
        axis=1
    )

    # Step 5: Clean up helper columns
    df = df.drop(columns=['mode_fueltype', 'mode_freq'], errors='ignore')

    return df
In [93]:
df_ft = fill_missing_fueltype(df_car)
df_ft.isna().sum()
Out[93]:
datecrawled                    0
price                          0
vehicletype                33013
registrationyear               0
gearbox                    19830
power                          0
model                      14878
mileage                        0
registrationmonth              0
fueltype                   21712
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
dtype: int64
In [94]:
def fill_missing_models_majority_new(df, threshold=0.9):

    df = df.copy()
    
    # --- Step 1: Bin registration year into your custom ranges ---
    def categorize_year(year):
        if pd.isna(year):
            return np.nan
        elif year < 1990:
            return 'before_1990'
        elif year < 2000:
            return '1990s'
        elif year < 2010:
            return '2000s'
        else:
            return '2010_plus'

    df['year_bin'] = df['registrationyear'].apply(categorize_year)

    # --- Step 2: Split known and missing models ---
    known = df[df['model'].notna()]
    missing = df[df['model'].isna()]

    # --- Step 3: Compute majority model per group ---
    model_counts = (
        known.groupby(['brand', 'vehicletype', 'year_bin'])['model']
        .value_counts(normalize=True)
        .rename('freq')
        .reset_index()
    )

    # Keep only models that dominate a group above the threshold (e.g., 90%)
    majority_models = model_counts[model_counts['freq'] >= threshold]

    # --- Step 4: Merge and fill missing models ---
    filled = missing.merge(
        majority_models[['brand', 'vehicletype', 'year_bin', 'model']],
        on=['brand', 'vehicletype', 'year_bin'],
        how='left',
        suffixes=('', '_majority')
    )

    # Fill missing model where a confident majority exists
    filled['model'] = filled['model_majority'].combine_first(filled['model'])
    filled.drop(columns=['model_majority'], inplace=True)

    # --- Step 5: Combine back ---
    result = pd.concat([known, filled], ignore_index=True)

    return result
In [95]:
df_model = fill_missing_models_majority_new(df_ft)
df_model.isna().sum()
Out[95]:
datecrawled                    0
price                          0
vehicletype                33013
registrationyear               0
gearbox                    19830
power                          0
model                      14455
mileage                        0
registrationmonth              0
fueltype                   21712
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
year_bin                       0
dtype: int64
In [96]:
def fill_missing_vehicle_type_majority(df, threshold=0.9):
    df = df.copy()

    # --- Step 1: Split into known and missing ---
    known = df[df['vehicletype'].notna()]
    missing = df[df['vehicletype'].isna()]

    # --- Step 2: Compute majority vehicletype per group ---
    vt_counts = (
        known.groupby(['brand', 'model', 'year_bin'])['vehicletype']
        .value_counts(normalize=True)
        .rename('freq')
        .reset_index()
    )

    majority_types = vt_counts[vt_counts['freq'] >= threshold]

    # --- Step 3: Merge and fill ---
    filled = missing.merge(
        majority_types[['brand', 'model', 'year_bin', 'vehicletype']],
        on=['brand', 'model', 'year_bin'],
        how='left',
        suffixes=('', '_majority')
    )

    filled['vehicletype'] = filled['vehicletype_majority'].combine_first(filled['vehicletype'])
    filled.drop(columns=['vehicletype_majority'], inplace=True)

    # --- Step 4: Combine back ---
    result = pd.concat([known, filled], ignore_index=True)

    return result
In [97]:
df_vt = fill_missing_vehicle_type_majority(df_model)
df_vt.isna().sum()
Out[97]:
datecrawled                    0
price                          0
vehicletype                24944
registrationyear               0
gearbox                    19830
power                          0
model                      14455
mileage                        0
registrationmonth              0
fueltype                   21712
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
year_bin                       0
dtype: int64
In [98]:
def fill_missing_models_majority_x(df, threshold=0.9):

    df = df.copy()
    missing_before = df['model'].isna().sum()

    # Define tiered grouping strategies from broad → narrow
    groupings = [
        ['brand', 'vehicletype'],
        ['brand', 'vehicletype', 'year_bin'],
        ['brand', 'fueltype', 'vehicletype'],
        ['brand', 'vehicletype', 'fueltype', 'year_bin']
    ]

    # Iterate through groupings
    for cols in groupings:
        majority_model = (
            df.groupby(cols)['model']
            .agg(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else np.nan)
        )
        counts = df.groupby(cols)['model'].value_counts(normalize=True).groupby(cols).max()
        majority_model = majority_model[counts.reindex(majority_model.index).fillna(False) >= threshold]
        df['model'] = df.apply(
            lambda row: majority_model.get(tuple(row[c] for c in cols), row['model'])
            if pd.isna(row['model'])
            else row['model'],
            axis=1
        )

    missing_after = df['model'].isna().sum()
    filled = missing_before - missing_after
    print(f"✅ Filled {filled} missing models (threshold={threshold:.0%})")
    
    return df
In [99]:
df_model_x = fill_missing_models_majority_x(df_vt)
✅ Filled 112 missing models (threshold=90%)
In [100]:
df_model_x.isna().sum()
Out[100]:
datecrawled                    0
price                          0
vehicletype                24944
registrationyear               0
gearbox                    19830
power                          0
model                      14343
mileage                        0
registrationmonth              0
fueltype                   21712
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
year_bin                       0
dtype: int64
In [101]:
def fill_missing_vehicle_type(df, threshold=0.9):
    """
    Fills missing vehicle types based on the most common value within:
      1. (model, brand, year_bin, power)
      2. (model, brand, year_bin)
      3. (model, brand)
    Only fills when the confidence (frequency ratio of the mode) 
    is above the given threshold.
    """
    df = df.copy()

    def safe_mode(series):
        m = series.mode(dropna=True)
        return m.iloc[0] if not m.empty else np.nan

    # -----------------------------
    # STEP 1: (model, brand, year_bin, power)
    # -----------------------------
    group_cols_detailed = ['model', 'brand', 'year_bin', 'power']
    grouped_detailed = df.groupby(group_cols_detailed)['vehicletype']

    majority_detailed = grouped_detailed.apply(safe_mode)
    confidence_detailed = grouped_detailed.apply(
        lambda x: x.value_counts(normalize=True).iloc[0] if not x.dropna().empty else 0
    )

    majority_detailed = majority_detailed[confidence_detailed >= threshold]
    majority_detailed = majority_detailed.rename('majority_type').reset_index()

    df = df.merge(majority_detailed, on=group_cols_detailed, how='left')

    # -----------------------------
    # STEP 2: (model, brand, year_bin)
    # -----------------------------
    missing_mask = df['vehicletype'].isna() & df['majority_type'].isna()
    group_cols_simple = ['model', 'brand', 'year_bin']
    grouped_simple = df.groupby(group_cols_simple)['vehicletype']

    majority_simple = grouped_simple.apply(safe_mode)
    confidence_simple = grouped_simple.apply(
        lambda x: x.value_counts(normalize=True).iloc[0] if not x.dropna().empty else 0
    )

    majority_simple = majority_simple[confidence_simple >= threshold]
    majority_simple = majority_simple.rename('fallback_type').reset_index()

    df = df.merge(majority_simple, on=group_cols_simple, how='left')

    # -----------------------------
    # STEP 3: (model, brand)
    # -----------------------------
    missing_mask_2 = (
        df['vehicletype'].isna()
        & df['majority_type'].isna()
        & df['fallback_type'].isna()
    )

    group_cols_brand_model = ['model', 'brand']
    grouped_brand_model = df.groupby(group_cols_brand_model)['vehicletype']

    majority_brand_model = grouped_brand_model.apply(safe_mode)
    confidence_brand_model = grouped_brand_model.apply(
        lambda x: x.value_counts(normalize=True).iloc[0] if not x.dropna().empty else 0
    )

    majority_brand_model = majority_brand_model[confidence_brand_model >= threshold]
    majority_brand_model = majority_brand_model.rename('bm_type').reset_index()

    df = df.merge(majority_brand_model, on=group_cols_brand_model, how='left')

    # -----------------------------
    # STEP 4: Fill missing progressively
    # -----------------------------
    df['vehicletype'] = df['vehicletype'].fillna(df['majority_type'])
    df['vehicletype'] = df['vehicletype'].fillna(df['fallback_type'])
    df['vehicletype'] = df['vehicletype'].fillna(df['bm_type'])

    # -----------------------------
    # STEP 5: Cleanup
    # -----------------------------
    df.drop(columns=['majority_type', 'fallback_type', 'bm_type'], inplace=True)

    return df
In [102]:
df_vetype = fill_missing_vehicle_type(df_model_x)
df_vetype.isna().sum()
Out[102]:
datecrawled                    0
price                          0
vehicletype                21015
registrationyear               0
gearbox                    19830
power                          0
model                      14343
mileage                        0
registrationmonth              0
fueltype                   21712
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
year_bin                       0
dtype: int64
In [103]:
def fill_all_missing_values(
    df,
    threshold=0.9,
    verbose=True,
    repeat_until_no_change=True,
    max_loops=5
):
    """
    Runs all fill functions in sequence (and optionally repeats)
    until no more missing values are filled.

    Parameters
    ----------
    df : pd.DataFrame
        The input dataframe.
    threshold : float, optional (default=0.9)
        Confidence threshold for majority-based fills.
    verbose : bool, optional (default=True)
        Print progress updates.
    repeat_until_no_change : bool, optional (default=True)
        If True, keeps looping until no new values are filled.
    max_loops : int, optional (default=5)
        Safety limit for maximum number of full passes.

    Returns
    -------
    df : pd.DataFrame
        The filled dataframe.
    """

    df = df.copy()

    steps = [
        ("Vehicle Type", fill_missing_vehicle_type),
        ("Model", fill_missing_models_majority_x),
        ("Fuel Type", fill_missing_fueltype),
        ("Power (0 HP)", fill_zero_power)
    ]

    def count_missing(d):
        return (
            d['vehicletype'].isna().sum(),
            d['model'].isna().sum(),
            d['fueltype'].isna().sum(),
            (d['power'] == 0).sum()
        )

    last_missing = count_missing(df)

    loop = 0
    while True:
        loop += 1
        if verbose:
            print(f"\n🔁 Pass {loop} (threshold={threshold:.0%})")

        for name, func in steps:
            if verbose:
                print(f"   ▶ Running {name} fill function...")
            try:
                df = func(df, threshold=threshold)
            except TypeError:
                df = func(df)
            except Exception as e:
                print(f"   ⚠️ Error in {name}: {e}")

        current_missing = count_missing(df)

        if verbose:
            print(f"   Missing counts after pass {loop}:")
            print(f"      vehicletype: {current_missing[0]:,}")
            print(f"      model:       {current_missing[1]:,}")
            print(f"      fueltype:    {current_missing[2]:,}")
            print(f"      power==0:    {current_missing[3]:,}")

        # Stop if no more changes
        if not repeat_until_no_change:
            break

        if current_missing == last_missing:
            if verbose:
                print("\n✅ No further fills detected — stopping.")
            break

        if loop >= max_loops:
            if verbose:
                print("\n⚠️ Reached max loop limit, stopping.")
            break

        last_missing = current_missing

    if verbose:
        print("\n🏁 All fill functions completed.\n")

    return df
In [104]:
df_car = fill_all_missing_values(df_vetype, threshold=0.7, repeat_until_no_change=True)
🔁 Pass 1 (threshold=70%)
   ▶ Running Vehicle Type fill function...
   ▶ Running Model fill function...
✅ Filled 878 missing models (threshold=70%)
   ▶ Running Fuel Type fill function...
   ▶ Running Power (0 HP) fill function...
   Missing counts after pass 1:
      vehicletype: 16,445
      model:       13,465
      fueltype:    15,722
      power==0:    35,309

🔁 Pass 2 (threshold=70%)
   ▶ Running Vehicle Type fill function...
   ▶ Running Model fill function...
✅ Filled 17 missing models (threshold=70%)
   ▶ Running Fuel Type fill function...
   ▶ Running Power (0 HP) fill function...
   Missing counts after pass 2:
      vehicletype: 15,387
      model:       13,448
      fueltype:    15,260
      power==0:    35,289

🔁 Pass 3 (threshold=70%)
   ▶ Running Vehicle Type fill function...
   ▶ Running Model fill function...
✅ Filled 0 missing models (threshold=70%)
   ▶ Running Fuel Type fill function...
   ▶ Running Power (0 HP) fill function...
   Missing counts after pass 3:
      vehicletype: 15,387
      model:       13,448
      fueltype:    15,251
      power==0:    35,289

🔁 Pass 4 (threshold=70%)
   ▶ Running Vehicle Type fill function...
   ▶ Running Model fill function...
✅ Filled 0 missing models (threshold=70%)
   ▶ Running Fuel Type fill function...
   ▶ Running Power (0 HP) fill function...
   Missing counts after pass 4:
      vehicletype: 15,387
      model:       13,448
      fueltype:    15,251
      power==0:    35,289

✅ No further fills detected — stopping.

🏁 All fill functions completed.

In [105]:
def correct_registration_years(df, threshold=0.9, proximity=1):
    """
    Corrects registration years flagged as 'too early' or 'too late'.
    Adds ±proximity tolerance when determining majority years.
    """

    df = df.copy()

    # --- Split flagged vs correct ---
    flagged_mask = df['registration_correction'].isin(['Y: too early', 'Y: too late'])
    flagged = df[flagged_mask].copy()
    correct = df[~flagged_mask].copy()

    if flagged.empty:
        return df  # nothing to fix

    # --- Helper: Cluster nearby years (±proximity) ---
    def cluster_years(series, proximity=1):
        if series.empty:
            return np.nan, 0
        years = series.dropna().astype(int)
        if years.empty:
            return np.nan, 0

        clusters = []
        for y in sorted(years.unique()):
            found = False
            for cluster in clusters:
                if abs(cluster['years'][-1] - y) <= proximity:
                    cluster['years'].append(y)
                    cluster['count'] += (years == y).sum()
                    found = True
                    break
            if not found:
                clusters.append({'years': [y], 'count': (years == y).sum()})

        top_cluster = max(clusters, key=lambda c: c['count'])
        cluster_year = int(np.round(np.mean(top_cluster['years'])))
        freq = top_cluster['count'] / len(years)
        return cluster_year, freq

    # --- Compute majority year per group ---
    def get_majority_table(group_cols):
        rows = []
        for name, group in correct.groupby(group_cols):
            year, freq = cluster_years(group['registrationyear'], proximity)
            rows.append((*name, year, freq, group['registrationyear'].min(), group['registrationyear'].max()))
        return pd.DataFrame(rows, columns=group_cols + ['majority_year','mode_freq','min','max'])

    # Start with detailed grouping
    majority_df = get_majority_table(['brand','model','year_bin','power','vehicletype'])

    flagged = flagged.merge(majority_df, on=['brand','model','year_bin','power','vehicletype'], how='left')

    # --- Fallback using (brand, model, vehicletype) ---
    missing_mask = flagged['majority_year'].isna()
    if missing_mask.any():
        fallback = get_majority_table(['brand','model','year_bin','vehicletype'])
        flagged = flagged.merge(
            fallback,
            on=['brand','model','year_bin','vehicletype'],
            how='left',
            suffixes=('','_fallback')
        )

        # Fill in missing majority fields from fallback where possible
        for col in ['majority_year','mode_freq','min','max']:
            flagged[col] = flagged[col].fillna(flagged[f"{col}_fallback"])

        # Clean up helper columns
        flagged.drop(columns=[c for c in flagged.columns if c.endswith('_fallback')], inplace=True)

    # --- Apply corrections ---
    def fill_year(row):
        if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold and pd.notna(row['majority_year']):
            return row['majority_year']
        elif row['registration_correction'] == 'Y: too early' and pd.notna(row['min']):
            return row['min']
        elif row['registration_correction'] == 'Y: too late' and pd.notna(row['max']):
            return row['max']
        else:
            return row['registrationyear']

    def fill_flag(row):
        if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold:
            return 'N'
        else:
            return row['registration_correction']

    flagged['registrationyear'] = flagged.apply(fill_year, axis=1)
    flagged['registration_correction'] = flagged.apply(fill_flag, axis=1)

    # --- Cleanup helper cols ---
    flagged.drop(columns=['majority_year','mode_freq','min','max'], inplace=True)

    # --- Combine back safely ---
    result = pd.concat([correct, flagged], ignore_index=True)
    return result
In [106]:
df_reg = correct_registration_years(df_car, threshold = 0.7, proximity = 5)
In [107]:
df_reg[(df_reg['registration_correction'] == "Y: too early") | (df_reg['registration_correction'] == "Y: too late")]
Out[107]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin
336004 11/03/2016 21:39 450 small 1910.0 NaN 0.0 ka 5000 0 petrol ford NaN 2016-11-03 0 24148 19/03/2016 08:46 Y: too early before_1990
336011 22/03/2016 14:55 3299 sedan 1989.0 auto 132.0 e_klasse 150000 6 petrol mercedes_benz no 2016-03-22 0 63801 06/04/2016 05:15 Y: too early before_1990
336013 16/03/2016 13:45 140 small 1986.0 NaN 0.0 cayenne 20000 0 petrol porsche NaN 2016-03-16 0 25860 17/03/2016 11:17 Y: too early before_1990
336014 29/03/2016 07:58 4300 coupe 1990.0 manual 170.0 90 150000 4 petrol audi NaN 2016-03-29 0 13595 05/04/2016 18:17 Y: too late 1990s
336015 28/03/2016 09:53 6990 wagon 1983.0 manual 72.0 e_klasse 150000 6 gasoline mercedes_benz no 2016-03-28 0 31737 06/04/2016 11:16 Y: too early before_1990
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
354102 16/03/2016 17:51 2300 NaN 2017.0 auto 192.0 NaN 150000 0 NaN bmw no 2016-03-16 0 45896 17/03/2016 16:17 Y: too late 2010_plus
354103 24/03/2016 16:54 900 NaN 2017.0 manual 101.0 NaN 150000 6 NaN opel NaN 2016-03-24 0 50170 07/04/2016 09:17 Y: too late 2010_plus
354104 07/04/2016 08:36 1670 NaN 2018.0 manual 0.0 NaN 90000 0 petrol renault no 2016-07-04 0 12167 07/04/2016 08:36 Y: too late 2010_plus
354105 04/04/2016 21:40 10980 NaN 2018.0 manual 75.0 NaN 20000 1 NaN volkswagen no 2016-04-04 0 44801 07/04/2016 00:15 Y: too late 2010_plus
354106 01/04/2016 02:36 1000 NaN 2017.0 manual 54.0 NaN 125000 2 NaN hyundai no 2016-01-04 0 67547 05/04/2016 02:45 Y: too late 2010_plus

7437 rows × 18 columns

In [108]:
df_reg_years = correct_registration_years(df_reg, threshold = 0.7, proximity = 10)
df_reg_years[(df_reg_years['registration_correction'] == "Y: too early") | (df_reg_years['registration_correction'] == "Y: too late")]
Out[108]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin
346670 11/03/2016 21:39 450 small 1910.0 NaN 0.0 ka 5000 0 petrol ford NaN 2016-11-03 0 24148 19/03/2016 08:46 Y: too early before_1990
346671 22/03/2016 14:55 3299 sedan 1989.0 auto 132.0 e_klasse 150000 6 petrol mercedes_benz no 2016-03-22 0 63801 06/04/2016 05:15 Y: too early before_1990
346672 16/03/2016 13:45 140 small 1986.0 NaN 0.0 cayenne 20000 0 petrol porsche NaN 2016-03-16 0 25860 17/03/2016 11:17 Y: too early before_1990
346673 29/03/2016 07:58 4300 coupe 1990.0 manual 170.0 90 150000 4 petrol audi NaN 2016-03-29 0 13595 05/04/2016 18:17 Y: too late 1990s
346674 28/03/2016 09:53 6990 wagon 1983.0 manual 72.0 e_klasse 150000 6 gasoline mercedes_benz no 2016-03-28 0 31737 06/04/2016 11:16 Y: too early before_1990
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
354102 16/03/2016 17:51 2300 NaN 2017.0 auto 192.0 NaN 150000 0 NaN bmw no 2016-03-16 0 45896 17/03/2016 16:17 Y: too late 2010_plus
354103 24/03/2016 16:54 900 NaN 2017.0 manual 101.0 NaN 150000 6 NaN opel NaN 2016-03-24 0 50170 07/04/2016 09:17 Y: too late 2010_plus
354104 07/04/2016 08:36 1670 NaN 2018.0 manual 0.0 NaN 90000 0 petrol renault no 2016-07-04 0 12167 07/04/2016 08:36 Y: too late 2010_plus
354105 04/04/2016 21:40 10980 NaN 2018.0 manual 75.0 NaN 20000 1 NaN volkswagen no 2016-04-04 0 44801 07/04/2016 00:15 Y: too late 2010_plus
354106 01/04/2016 02:36 1000 NaN 2017.0 manual 54.0 NaN 125000 2 NaN hyundai no 2016-01-04 0 67547 05/04/2016 02:45 Y: too late 2010_plus

7208 rows × 18 columns

In [109]:
def correct_registration_years_x(df, threshold=0.9, proximity=1):
    """
    Corrects registration years flagged as 'too early' or 'too late'.
    Adds ±proximity tolerance when determining majority years.
    """

    df = df.copy()

    # --- Split flagged vs correct ---
    flagged_mask = df['registration_correction'].isin(['Y: too early', 'Y: too late'])
    flagged = df[flagged_mask].copy()
    correct = df[~flagged_mask].copy()

    if flagged.empty:
        return df  # nothing to fix

    # --- Helper: Cluster nearby years (±proximity) ---
    def cluster_years(series, proximity=1):
        if series.empty:
            return np.nan, 0
        years = series.dropna().astype(int)
        if years.empty:
            return np.nan, 0

        clusters = []
        for y in sorted(years.unique()):
            found = False
            for cluster in clusters:
                if abs(cluster['years'][-1] - y) <= proximity:
                    cluster['years'].append(y)
                    cluster['count'] += (years == y).sum()
                    found = True
                    break
            if not found:
                clusters.append({'years': [y], 'count': (years == y).sum()})

        top_cluster = max(clusters, key=lambda c: c['count'])
        cluster_year = int(np.round(np.mean(top_cluster['years'])))
        freq = top_cluster['count'] / len(years)
        return cluster_year, freq

    # --- Compute majority year per group ---
    def get_majority_table(group_cols):
        rows = []
        for name, group in correct.groupby(group_cols):
            year, freq = cluster_years(group['registrationyear'], proximity)
            rows.append((*name, year, freq, group['registrationyear'].min(), group['registrationyear'].max()))
        return pd.DataFrame(rows, columns=group_cols + ['majority_year','mode_freq','min','max'])

    # Start with detailed grouping
    majority_df = get_majority_table(['brand','model','power','vehicletype'])

    flagged = flagged.merge(majority_df, on=['brand','model','power','vehicletype'], how='left')

    # --- Fallback using (brand, model, vehicletype) ---
    missing_mask = flagged['majority_year'].isna()
    if missing_mask.any():
        fallback = get_majority_table(['brand','model','vehicletype'])
        flagged = flagged.merge(
            fallback,
            on=['brand','model','vehicletype'],
            how='left',
            suffixes=('','_fallback')
        )

        # Fill in missing majority fields from fallback where possible
        for col in ['majority_year','mode_freq','min','max']:
            flagged[col] = flagged[col].fillna(flagged[f"{col}_fallback"])

        # Clean up helper columns
        flagged.drop(columns=[c for c in flagged.columns if c.endswith('_fallback')], inplace=True)

    # --- Apply corrections ---
    def fill_year(row):
        if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold and pd.notna(row['majority_year']):
            return row['majority_year']
        elif row['registration_correction'] == 'Y: too early' and pd.notna(row['min']):
            return row['min']
        elif row['registration_correction'] == 'Y: too late' and pd.notna(row['max']):
            return row['max']
        else:
            return row['registrationyear']

    def fill_flag(row):
        if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold:
            return 'N'
        else:
            return row['registration_correction']

    flagged['registrationyear'] = flagged.apply(fill_year, axis=1)
    flagged['registration_correction'] = flagged.apply(fill_flag, axis=1)

    # --- Cleanup helper cols ---
    flagged.drop(columns=['majority_year','mode_freq','min','max'], inplace=True)

    # --- Combine back safely ---
    result = pd.concat([correct, flagged], ignore_index=True)
    return result
In [110]:
df_years = correct_registration_years_x(df_reg_years, threshold = 0.7, proximity = 1)
df_years[(df_years['registration_correction'] == "Y: too early") | (df_years['registration_correction'] == "Y: too late")]
Out[110]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin
346901 16/03/2016 13:45 140 small 1986.0 NaN 0.0 cayenne 20000 0 petrol porsche NaN 2016-03-16 0 25860 17/03/2016 11:17 Y: too early before_1990
346904 23/03/2016 11:52 6900 sedan 1996.0 manual 105.0 e_klasse 150000 6 petrol mercedes_benz no 2016-03-23 0 86609 05/04/2016 11:18 Y: too early before_1990
346905 27/03/2016 12:52 7900 sedan 1996.0 auto 194.0 e_klasse 125000 9 petrol mercedes_benz no 2016-03-27 0 80337 07/04/2016 08:17 Y: too early before_1990
346909 14/03/2016 09:50 999 coupe 2007.0 NaN 0.0 1er 150000 8 NaN bmw no 2016-03-14 0 76131 26/03/2016 02:46 Y: too early 1990s
346914 31/03/2016 17:37 1000 small 1994.0 manual 0.0 antara 70000 9 petrol opel no 2016-03-31 0 16775 06/04/2016 10:45 Y: too early 1990s
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
354102 16/03/2016 17:51 2300 NaN 2017.0 auto 192.0 NaN 150000 0 NaN bmw no 2016-03-16 0 45896 17/03/2016 16:17 Y: too late 2010_plus
354103 24/03/2016 16:54 900 NaN 2017.0 manual 101.0 NaN 150000 6 NaN opel NaN 2016-03-24 0 50170 07/04/2016 09:17 Y: too late 2010_plus
354104 07/04/2016 08:36 1670 NaN 2018.0 manual 0.0 NaN 90000 0 petrol renault no 2016-07-04 0 12167 07/04/2016 08:36 Y: too late 2010_plus
354105 04/04/2016 21:40 10980 NaN 2018.0 manual 75.0 NaN 20000 1 NaN volkswagen no 2016-04-04 0 44801 07/04/2016 00:15 Y: too late 2010_plus
354106 01/04/2016 02:36 1000 NaN 2017.0 manual 54.0 NaN 125000 2 NaN hyundai no 2016-01-04 0 67547 05/04/2016 02:45 Y: too late 2010_plus

5480 rows × 18 columns

In [111]:
del df_reg_years
In [112]:
df_years1 = correct_registration_years_x(df_years, threshold = 0.7, proximity = 3)
df_years1[(df_years1['registration_correction'] == "Y: too early") | (df_years1['registration_correction'] == "Y: too late")]
Out[112]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin
348627 16/03/2016 13:45 140 small 1986.0 NaN 0.0 cayenne 20000 0 petrol porsche NaN 2016-03-16 0 25860 17/03/2016 11:17 Y: too early before_1990
348628 23/03/2016 11:52 6900 sedan 1996.0 manual 105.0 e_klasse 150000 6 petrol mercedes_benz no 2016-03-23 0 86609 05/04/2016 11:18 Y: too early before_1990
348629 27/03/2016 12:52 7900 sedan 1996.0 auto 194.0 e_klasse 125000 9 petrol mercedes_benz no 2016-03-27 0 80337 07/04/2016 08:17 Y: too early before_1990
348631 31/03/2016 17:37 1000 small 1994.0 manual 0.0 antara 70000 9 petrol opel no 2016-03-31 0 16775 06/04/2016 10:45 Y: too early 1990s
348632 29/03/2016 11:53 3500 sedan 1998.0 auto 185.0 e_klasse 150000 0 petrol mercedes_benz no 2016-03-29 0 15328 05/04/2016 21:45 Y: too early before_1990
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
354102 16/03/2016 17:51 2300 NaN 2017.0 auto 192.0 NaN 150000 0 NaN bmw no 2016-03-16 0 45896 17/03/2016 16:17 Y: too late 2010_plus
354103 24/03/2016 16:54 900 NaN 2017.0 manual 101.0 NaN 150000 6 NaN opel NaN 2016-03-24 0 50170 07/04/2016 09:17 Y: too late 2010_plus
354104 07/04/2016 08:36 1670 NaN 2018.0 manual 0.0 NaN 90000 0 petrol renault no 2016-07-04 0 12167 07/04/2016 08:36 Y: too late 2010_plus
354105 04/04/2016 21:40 10980 NaN 2018.0 manual 75.0 NaN 20000 1 NaN volkswagen no 2016-04-04 0 44801 07/04/2016 00:15 Y: too late 2010_plus
354106 01/04/2016 02:36 1000 NaN 2017.0 manual 54.0 NaN 125000 2 NaN hyundai no 2016-01-04 0 67547 05/04/2016 02:45 Y: too late 2010_plus

5372 rows × 18 columns

In [113]:
del df_years
In [114]:
fix80 = (df_years1['registrationyear'] < 1990) & (df_years1['year_bin'] != 'before_1990')
df_years1.loc[fix80,['year_bin']] = 'before_1990'

fix90 = (df_years1['registrationyear'] > 1989) & (df_years1['registrationyear'] < 2000) & (df_years1['year_bin'] != '1990s')
df_years1.loc[fix90,['year_bin']] = '1990s'

fix00 = (df_years1['registrationyear'] > 1999) & (df_years1['registrationyear'] < 2010) & (df_years1['year_bin'] != '2000s')
df_years1.loc[fix00,['year_bin']] = '2000s'

fix10 = (df_years1['registrationyear'] > 2009) & (df_years1['year_bin'] != '2010_plus')
df_years1.loc[fix10,['year_bin']] = '2010_plus'
In [115]:
trabantfix = (df_years1['brand'] == 'trabant') & (df_years1['model'] == 'other') & (df_years1['registration_correction'] != 'N') & (df_years1['registrationyear'] == 1964)
df_years1.loc[trabantfix, ['registration_correction']] = 'N'

citroenfix = (df_years1['brand'] == 'citroen') & (df_years1['model'] == 'other') & (df_years1['registration_correction'] == 'Y: too early') & (df_years1['registrationyear'] == 1934)
df_years1.loc[citroenfix, ['registration_correction']] = 'N'
In [116]:
df_reg = correct_registration_years(df_years1, threshold = 0.7, proximity = 5)
df_reg[(df_reg['registration_correction'] == "Y: too early") | (df_reg['registration_correction'] == "Y: too late")]
Out[116]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin
348737 16/03/2016 13:45 140 small 1986.0 NaN 0.0 cayenne 20000 0 petrol porsche NaN 2016-03-16 0 25860 17/03/2016 11:17 Y: too early before_1990
348740 31/03/2016 17:37 1000 small 1994.0 manual 0.0 antara 70000 9 petrol opel no 2016-03-31 0 16775 06/04/2016 10:45 Y: too early 1990s
348742 21/03/2016 02:00 3500 small 1992.0 NaN 0.0 e_klasse 150000 1 NaN mercedes_benz no 2016-03-21 0 68799 06/04/2016 00:44 Y: too early 1990s
348743 20/03/2016 19:27 12500 suv 2005.0 auto 296.0 range_rover_evoque 150000 8 gasoline land_rover no 2016-03-20 0 61462 20/03/2016 20:38 Y: too early 2000s
348745 10/03/2016 23:44 3490 wagon 2007.0 manual 101.0 calibra 150000 12 gasoline opel no 2016-10-03 0 66953 11/03/2016 12:17 Y: too late 2000s
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
354102 16/03/2016 17:51 2300 NaN 2017.0 auto 192.0 NaN 150000 0 NaN bmw no 2016-03-16 0 45896 17/03/2016 16:17 Y: too late 2010_plus
354103 24/03/2016 16:54 900 NaN 2017.0 manual 101.0 NaN 150000 6 NaN opel NaN 2016-03-24 0 50170 07/04/2016 09:17 Y: too late 2010_plus
354104 07/04/2016 08:36 1670 NaN 2018.0 manual 0.0 NaN 90000 0 petrol renault no 2016-07-04 0 12167 07/04/2016 08:36 Y: too late 2010_plus
354105 04/04/2016 21:40 10980 NaN 2018.0 manual 75.0 NaN 20000 1 NaN volkswagen no 2016-04-04 0 44801 07/04/2016 00:15 Y: too late 2010_plus
354106 01/04/2016 02:36 1000 NaN 2017.0 manual 54.0 NaN 125000 2 NaN hyundai no 2016-01-04 0 67547 05/04/2016 02:45 Y: too late 2010_plus

5327 rows × 18 columns

In [117]:
del df_years1
In [118]:
df_app = fill_all_missing_values(df_reg, threshold=0.7, repeat_until_no_change=True)
🔁 Pass 1 (threshold=70%)
   ▶ Running Vehicle Type fill function...
   ▶ Running Model fill function...
✅ Filled 57 missing models (threshold=70%)
   ▶ Running Fuel Type fill function...
   ▶ Running Power (0 HP) fill function...
   Missing counts after pass 1:
      vehicletype: 15,376
      model:       13,391
      fueltype:    14,451
      power==0:    34,829

🔁 Pass 2 (threshold=70%)
   ▶ Running Vehicle Type fill function...
   ▶ Running Model fill function...
✅ Filled 7 missing models (threshold=70%)
   ▶ Running Fuel Type fill function...
   ▶ Running Power (0 HP) fill function...
   Missing counts after pass 2:
      vehicletype: 15,346
      model:       13,384
      fueltype:    14,379
      power==0:    34,827

🔁 Pass 3 (threshold=70%)
   ▶ Running Vehicle Type fill function...
   ▶ Running Model fill function...
✅ Filled 2 missing models (threshold=70%)
   ▶ Running Fuel Type fill function...
   ▶ Running Power (0 HP) fill function...
   Missing counts after pass 3:
      vehicletype: 15,346
      model:       13,382
      fueltype:    14,371
      power==0:    34,827

🔁 Pass 4 (threshold=70%)
   ▶ Running Vehicle Type fill function...
   ▶ Running Model fill function...
✅ Filled 0 missing models (threshold=70%)
   ▶ Running Fuel Type fill function...
   ▶ Running Power (0 HP) fill function...
   Missing counts after pass 4:
      vehicletype: 15,346
      model:       13,382
      fueltype:    14,371
      power==0:    34,827

✅ No further fills detected — stopping.

🏁 All fill functions completed.

In [119]:
df_reg = correct_registration_years(df_app, threshold = 0.7, proximity = 10)
df_reg[(df_reg['registration_correction'] == "Y: too early") | (df_reg['registration_correction'] == "Y: too late")]
Out[119]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin
348780 16/03/2016 13:45 140 small 1986.0 NaN 0.0 cayenne 20000 0 petrol porsche NaN 2016-03-16 0 25860 17/03/2016 11:17 Y: too early before_1990
348781 31/03/2016 17:37 1000 small 1994.0 manual 0.0 antara 70000 9 petrol opel no 2016-03-31 0 16775 06/04/2016 10:45 Y: too early 1990s
348782 21/03/2016 02:00 3500 small 1992.0 NaN 0.0 e_klasse 150000 1 NaN mercedes_benz no 2016-03-21 0 68799 06/04/2016 00:44 Y: too early 1990s
348783 20/03/2016 19:27 12500 suv 2005.0 auto 296.0 range_rover_evoque 150000 8 gasoline land_rover no 2016-03-20 0 61462 20/03/2016 20:38 Y: too early 2000s
348784 10/03/2016 23:44 3490 wagon 2007.0 manual 101.0 calibra 150000 12 gasoline opel no 2016-10-03 0 66953 11/03/2016 12:17 Y: too late 2000s
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
354102 16/03/2016 17:51 2300 NaN 2017.0 auto 192.0 NaN 150000 0 NaN bmw no 2016-03-16 0 45896 17/03/2016 16:17 Y: too late 2010_plus
354103 24/03/2016 16:54 900 NaN 2017.0 manual 101.0 NaN 150000 6 NaN opel NaN 2016-03-24 0 50170 07/04/2016 09:17 Y: too late 2010_plus
354104 07/04/2016 08:36 1670 NaN 2018.0 manual 0.0 NaN 90000 0 petrol renault no 2016-07-04 0 12167 07/04/2016 08:36 Y: too late 2010_plus
354105 04/04/2016 21:40 10980 NaN 2018.0 manual 75.0 NaN 20000 1 NaN volkswagen no 2016-04-04 0 44801 07/04/2016 00:15 Y: too late 2010_plus
354106 01/04/2016 02:36 1000 NaN 2017.0 manual 54.0 NaN 125000 2 NaN hyundai no 2016-01-04 0 67547 05/04/2016 02:45 Y: too late 2010_plus

5324 rows × 18 columns

In [120]:
del df_app
In [121]:
def correct_registration_years1(df, threshold=0.9, proximity=1):
    """
    Corrects registration years flagged as 'too early' or 'too late'.
    Adds ±proximity tolerance when determining majority years.
    """

    df = df.copy()

    # --- Split flagged vs correct ---
    flagged_mask = df['registration_correction'].isin(['Y: too early', 'Y: too late'])
    flagged = df[flagged_mask].copy()
    correct = df[~flagged_mask].copy()

    if flagged.empty:
        return df  # nothing to fix

    # --- Helper: Cluster nearby years (±proximity) ---
    def cluster_years(series, proximity=1):
        if series.empty:
            return np.nan, 0
        years = series.dropna().astype(int)
        if years.empty:
            return np.nan, 0

        clusters = []
        for y in sorted(years.unique()):
            found = False
            for cluster in clusters:
                if abs(cluster['years'][-1] - y) <= proximity:
                    cluster['years'].append(y)
                    cluster['count'] += (years == y).sum()
                    found = True
                    break
            if not found:
                clusters.append({'years': [y], 'count': (years == y).sum()})

        top_cluster = max(clusters, key=lambda c: c['count'])
        cluster_year = int(np.round(np.mean(top_cluster['years'])))
        freq = top_cluster['count'] / len(years)
        return cluster_year, freq

    # --- Compute majority year per group ---
    def get_majority_table(group_cols):
        rows = []
        for name, group in correct.groupby(group_cols):
            year, freq = cluster_years(group['registrationyear'], proximity)
            rows.append((*name, year, freq, group['registrationyear'].min(), group['registrationyear'].max()))
        return pd.DataFrame(rows, columns=group_cols + ['majority_year','mode_freq','min','max'])

    # Start with detailed grouping
    majority_df = get_majority_table(['brand','model','year_bin','power'])

    flagged = flagged.merge(majority_df, on=['brand','model','year_bin','power'], how='left')

    # --- Fallback using (brand, model, yearbin) ---
    missing_mask = flagged['majority_year'].isna()
    if missing_mask.any():
        fallback = get_majority_table(['brand','model','year_bin'])
        flagged = flagged.merge(
            fallback,
            on=['brand','model','year_bin',],
            how='left',
            suffixes=('','_fallback')
        )

        # Fill in missing majority fields from fallback where possible
        for col in ['majority_year','mode_freq','min','max']:
            flagged[col] = flagged[col].fillna(flagged[f"{col}_fallback"])

        # Clean up helper columns
        flagged.drop(columns=[c for c in flagged.columns if c.endswith('_fallback')], inplace=True)

    # --- Apply corrections ---
    def fill_year(row):
        if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold and pd.notna(row['majority_year']):
            return row['majority_year']
        elif row['registration_correction'] == 'Y: too early' and pd.notna(row['min']):
            return row['min']
        elif row['registration_correction'] == 'Y: too late' and pd.notna(row['max']):
            return row['max']
        else:
            return row['registrationyear']

    def fill_flag(row):
        if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold:
            return 'N'
        else:
            return row['registration_correction']

    flagged['registrationyear'] = flagged.apply(fill_year, axis=1)
    flagged['registration_correction'] = flagged.apply(fill_flag, axis=1)

    # --- Cleanup helper cols ---
    flagged.drop(columns=['majority_year','mode_freq','min','max'], inplace=True)

    # --- Combine back safely ---
    result = pd.concat([correct, flagged], ignore_index=True)
    return result
In [122]:
df_reg1 = correct_registration_years1(df_reg, threshold = 0.7, proximity = 5)
df_reg1[(df_reg1['registration_correction'] == "Y: too early") | (df_reg1['registration_correction'] == "Y: too late")]
Out[122]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin
348783 16/03/2016 13:45 140 small 1986.0 NaN 0.0 cayenne 20000 0 petrol porsche NaN 2016-03-16 0 25860 17/03/2016 11:17 Y: too early before_1990
348784 31/03/2016 17:37 1000 small 1994.0 manual 0.0 antara 70000 9 petrol opel no 2016-03-31 0 16775 06/04/2016 10:45 Y: too early 1990s
348786 20/03/2016 19:27 12500 suv 2005.0 auto 296.0 range_rover_evoque 150000 8 gasoline land_rover no 2016-03-20 0 61462 20/03/2016 20:38 Y: too early 2000s
348787 10/03/2016 23:44 3490 wagon 2007.0 manual 101.0 calibra 150000 12 gasoline opel no 2016-10-03 0 66953 11/03/2016 12:17 Y: too late 2000s
348788 21/03/2016 12:51 1400 coupe 1999.0 auto 196.0 glk 150000 7 petrol mercedes_benz no 2016-03-21 0 47441 21/03/2016 12:51 Y: too early 1990s
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
354102 16/03/2016 17:51 2300 NaN 2017.0 auto 192.0 NaN 150000 0 NaN bmw no 2016-03-16 0 45896 17/03/2016 16:17 Y: too late 2010_plus
354103 24/03/2016 16:54 900 NaN 2017.0 manual 101.0 NaN 150000 6 NaN opel NaN 2016-03-24 0 50170 07/04/2016 09:17 Y: too late 2010_plus
354104 07/04/2016 08:36 1670 NaN 2018.0 manual 0.0 NaN 90000 0 petrol renault no 2016-07-04 0 12167 07/04/2016 08:36 Y: too late 2010_plus
354105 04/04/2016 21:40 10980 NaN 2018.0 manual 75.0 NaN 20000 1 NaN volkswagen no 2016-04-04 0 44801 07/04/2016 00:15 Y: too late 2010_plus
354106 01/04/2016 02:36 1000 NaN 2017.0 manual 54.0 NaN 125000 2 NaN hyundai no 2016-01-04 0 67547 05/04/2016 02:45 Y: too late 2010_plus

2524 rows × 18 columns

In [123]:
def correct_registration_years2(df, threshold=0.9, proximity=1):
    """
    Corrects registration years flagged as 'too early' or 'too late'.
    Adds ±proximity tolerance when determining majority years.
    """

    df = df.copy()

    # --- Split flagged vs correct ---
    flagged_mask = df['registration_correction'].isin(['Y: too early', 'Y: too late'])
    flagged = df[flagged_mask].copy()
    correct = df[~flagged_mask].copy()

    if flagged.empty:
        return df  # nothing to fix

    # --- Helper: Cluster nearby years (±proximity) ---
    def cluster_years(series, proximity=1):
        if series.empty:
            return np.nan, 0
        years = series.dropna().astype(int)
        if years.empty:
            return np.nan, 0

        clusters = []
        for y in sorted(years.unique()):
            found = False
            for cluster in clusters:
                if abs(cluster['years'][-1] - y) <= proximity:
                    cluster['years'].append(y)
                    cluster['count'] += (years == y).sum()
                    found = True
                    break
            if not found:
                clusters.append({'years': [y], 'count': (years == y).sum()})

        top_cluster = max(clusters, key=lambda c: c['count'])
        cluster_year = int(np.round(np.mean(top_cluster['years'])))
        freq = top_cluster['count'] / len(years)
        return cluster_year, freq

    # --- Compute majority year per group ---
    def get_majority_table(group_cols):
        rows = []
        for name, group in correct.groupby(group_cols):
            year, freq = cluster_years(group['registrationyear'], proximity)
            rows.append((*name, year, freq, group['registrationyear'].min(), group['registrationyear'].max()))
        return pd.DataFrame(rows, columns=group_cols + ['majority_year','mode_freq','min','max'])

    # Start with detailed grouping
    majority_df = get_majority_table(['brand','model','year_bin'])

    flagged = flagged.merge(majority_df, on=['brand','model','year_bin'], how='left')

    # --- Fallback using (brand, yearbin) ---
    missing_mask = flagged['majority_year'].isna()
    if missing_mask.any():
        fallback = get_majority_table(['brand','year_bin'])
        flagged = flagged.merge(
            fallback,
            on=['brand','year_bin',],
            how='left',
            suffixes=('','_fallback')
        )

        # Fill in missing majority fields from fallback where possible
        for col in ['majority_year','mode_freq','min','max']:
            flagged[col] = flagged[col].fillna(flagged[f"{col}_fallback"])

        # Clean up helper columns
        flagged.drop(columns=[c for c in flagged.columns if c.endswith('_fallback')], inplace=True)

    # --- Apply corrections ---
    def fill_year(row):
        if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold and pd.notna(row['majority_year']):
            return row['majority_year']
        elif row['registration_correction'] == 'Y: too early' and pd.notna(row['min']):
            return row['min']
        elif row['registration_correction'] == 'Y: too late' and pd.notna(row['max']):
            return row['max']
        else:
            return row['registrationyear']

    def fill_flag(row):
        if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold:
            return 'N'
        else:
            return row['registration_correction']

    flagged['registrationyear'] = flagged.apply(fill_year, axis=1)
    flagged['registration_correction'] = flagged.apply(fill_flag, axis=1)

    # --- Cleanup helper cols ---
    flagged.drop(columns=['majority_year','mode_freq','min','max'], inplace=True)

    # --- Combine back safely ---
    result = pd.concat([correct, flagged], ignore_index=True)
    return result
In [124]:
df_reg2 = correct_registration_years2(df_reg1, threshold = 0.7, proximity = 5)
df_reg2[(df_reg2['registration_correction'] == "Y: too early") | (df_reg2['registration_correction'] == "Y: too late")].head()
Out[124]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin
351589 04/04/2016 20:57 14500 suv 2015.0 manual 26.0 601 40000 4 petrol trabant no 2016-04-04 0 98704 06/04/2016 23:44 Y: too late 2010_plus
351633 27/03/2016 13:46 2300 suv 2017.0 manual 26.0 601 70000 1 other trabant no 2016-03-27 0 39443 07/04/2016 09:45 Y: too late 2010_plus
351659 26/03/2016 13:46 2190 suv 2017.0 manual 0.0 601 50000 1 petrol trabant NaN 2016-03-26 0 98617 06/04/2016 01:44 Y: too late 2010_plus
351717 21/03/2016 20:56 1900 suv 2016.0 NaN 26.0 601 30000 6 petrol trabant NaN 2016-03-21 0 16259 07/04/2016 00:44 Y: too late 2010_plus
351749 22/03/2016 18:46 150 NaN 2015.0 NaN 0.0 other 80000 0 NaN trabant NaN 2016-03-22 0 39340 22/03/2016 18:46 Y: too late 2010_plus
In [125]:
del df_reg1
In [126]:
df_reg2[(df_reg2['model'] == '601') & (df_reg2['registration_correction'] == "N")].median()
Out[126]:
price                  900.0
registrationyear      1987.0
power                   26.0
model                  601.0
mileage              50000.0
registrationmonth        2.0
numberofpictures         0.0
postalcode           16759.5
dtype: float64
In [127]:
tra601 = (df_reg2['model'] == '601') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[tra601,['registrationyear']] = 1987

df_reg2.loc[tra601,['registration_correction']] = "N"
In [128]:
df_reg2[(df_reg2['brand'] == 'trabant') & (df_reg2['registration_correction'] == "N")].median()
Out[128]:
price                  945.0
registrationyear      1987.0
power                   26.0
mileage              50000.0
registrationmonth        1.0
numberofpictures         0.0
postalcode           16547.0
dtype: float64
In [129]:
tratra = (df_reg2['brand'] == 'trabant') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[tratra,['registrationyear']] = 1987

df_reg2.loc[tratra,['registration_correction']] = "N"
In [130]:
df_reg2[(df_reg2['brand'] == 'rover') & (df_reg2['registration_correction'] == "N")].median()
Out[130]:
price                   949.0
registrationyear       1999.0
power                   103.0
mileage              150000.0
registrationmonth         6.0
numberofpictures          0.0
postalcode            45881.0
dtype: float64
In [131]:
rover = (df_reg2['brand'] == 'rover') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[rover,['registrationyear']] = 1999
df_reg2.loc[rover,['registration_correction']] = "N"
In [132]:
df_reg2[(df_reg2['brand'] == 'hyundai') & (df_reg2['registration_correction'] == "N")].median()
Out[132]:
price                  3850.0
registrationyear       2007.0
power                   102.0
mileage              125000.0
registrationmonth         6.0
numberofpictures          0.0
postalcode            49586.0
dtype: float64
In [133]:
hyun = (df_reg2['brand'] == 'hyundai') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[hyun,['registrationyear']] = 2007
df_reg2.loc[hyun,['registration_correction']] = "N"
In [134]:
df_reg2[(df_reg2['brand'] == 'audi') & (df_reg2['model'] == 'q7') & (df_reg2['registration_correction'] == "N")].median()
Out[134]:
price                 15500.0
registrationyear       2007.0
power                   233.0
mileage              150000.0
registrationmonth         8.0
numberofpictures          0.0
postalcode            46485.0
dtype: float64
In [135]:
aq7 = (df_reg2['brand'] == 'audi') & (df_reg2['model'] == 'q7') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[aq7,['registrationyear']] = 2007 
df_reg2.loc[aq7,['registration_correction']] = "N"
In [136]:
fix80 = (df_reg2['registrationyear'] < 1990) & (df_reg2['year_bin'] != 'before_1990')
df_reg2.loc[fix80,['year_bin']] = 'before_1990'

fix90 = (df_reg2['registrationyear'] > 1989) & (df_reg2['registrationyear'] < 2000) & (df_reg2['year_bin'] != '1990s')
df_reg2.loc[fix90,['year_bin']] = '1990s'

fix00 = (df_reg2['registrationyear'] > 1999) & (df_reg2['registrationyear'] < 2010) & (df_reg2['year_bin'] != '2000s')
df_reg2.loc[fix00,['year_bin']] = '2000s'

fix10 = (df_reg2['registrationyear'] > 2009) & (df_reg2['year_bin'] != '2010_plus')
df_reg2.loc[fix10,['year_bin']] = '2010_plus'
In [137]:
df_app3 = fill_missing_vehicle_type(df_reg2, threshold = 0.7)
df_app3.isna().sum()
Out[137]:
datecrawled                    0
price                          0
vehicletype                15345
registrationyear               0
gearbox                    19830
power                          0
model                      13382
mileage                        0
registrationmonth              0
fueltype                   14371
brand                          0
notrepaired                71145
datecreated                    0
numberofpictures               0
postalcode                     0
lastseen                       0
registration_correction        0
year_bin                       0
dtype: int64
In [138]:
del df_reg2
In [139]:
too_high_hp = (df_app3['power'] > 999)
df_app3.loc[too_high_hp,['power']] = 0

hp_toohigh = (df_app3['power'] > 621) & (df_app3['model'] != 'other') & (df_app3['model'] != '5er')
df_app3.loc[hp_toohigh,['power']] = 0

hp_high = (df_app3['power'] > 450) & (~(df_app3['brand'].isin(['mercedes_benz','audi','bmw','porsche','ford']))) & (df_app3['model'] != 'other')
df_app3.loc[hp_high,['power']] = 0

vwgolfhigh = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'golf') & (df_app3['power'] > 306)
df_app3.loc[vwgolfhigh,['power']] = 0

polohighe = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'polo') & (df_app3['power'] > 200)
df_app3.loc[polohighe,['power']] = 0

passathigh = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'passat') & (df_app3['power'] > 300)
df_app3.loc[passathigh,['power']] = 0

jagxhigh = (df_app3['brand'] == 'jaguar') & (df_app3['model'] == 'x_type') & (df_app3['power'] > 240)
df_app3.loc[jagxhigh,['power']] = 0

captivahigh = (df_app3['brand'] == 'chevrolet') & (df_app3['model'] == 'captiva') & (df_app3['power'] > 258)
df_app3.loc[captivahigh,['power']] = 0

vwhigh = (df_app3['brand'] == 'volkswagen') & (df_app3['power'] > 420)
df_app3.loc[vwhigh,['power']] = 0

citroenhigh = (df_app3['brand'] == 'citroen') & (df_app3['power'] > 241)
df_app3.loc[citroenhigh,['power']] = 0

chryslerhigh = (df_app3['brand'] == 'chrysler') & (df_app3['power'] > 470)
df_app3.loc[chryslerhigh,['power']] = 0

fiathigh = (df_app3['brand'] == 'fiat') & (df_app3['power'] > 220)
df_app3.loc[fiathigh,['power']] = 0

suzukihigh = (df_app3['brand'] == 'suzuki') & (df_app3['power'] > 290)
df_app3.loc[suzukihigh,['power']] = 0

arhigh = (df_app3['brand'] == 'alfa_romeo') & (df_app3['power'] > 505)
df_app3.loc[arhigh,['power']] = 0

fordhigh = (df_app3['brand'] == 'ford') & (df_app3['power'] > 760)
df_app3.loc[fordhigh,['power']] = 0

chevyhigh = (df_app3['brand'] == 'chevrolet') & (df_app3['power'] > 650)
df_app3.loc[chevyhigh,['power']] = 0

hyundaihigh = (df_app3['brand'] == 'hyundai') & (df_app3['power'] > 370)
df_app3.loc[hyundaihigh,['power']] = 0

mitsubishihigh = (df_app3['brand'] == 'mitsubishi') & (df_app3['power'] > 440)
df_app3.loc[mitsubishihigh,['power']] = 0

nissanhigh =  (df_app3['brand'] == 'nissan') & (df_app3['power'] > 600)
df_app3.loc[nissanhigh,['power']] = 0

opelhigh = (df_app3['brand'] == 'opel') & (df_app3['power'] > 577)
df_app3.loc[opelhigh,['power']] = 0

pehigh = (df_app3['brand'] == 'peugeot') & (df_app3['power'] > 360)
df_app3.loc[pehigh,['power']] = 0

seathigh = (df_app3['brand'] == 'seat') & (df_app3['power'] > 340)
df_app3.loc[seathigh,['power']] = 0

volvohigh = (df_app3['brand'] == 'volvo') & (df_app3['power'] > 510)
df_app3.loc[volvohigh,['power']] = 0

smarthigh = (df_app3['brand'] == 'smart') & (df_app3['power'] > 422)
df_app3.loc[smarthigh,['power']] = 0
In [140]:
del too_high_hp 
del hp_toohigh 
del hp_high 
del vwgolfhigh 
del polohighe 
del passathigh
del jagxhigh
del captivahigh 
del vwhigh
del citroenhigh 
del chryslerhigh
del fiathigh 
del suzukihigh 
del arhigh 
del fordhigh
del chevyhigh 
del hyundaihigh 
del mitsubishihigh 
del nissanhigh 
del opelhigh 
del pehigh 
del seathigh 
del volvohigh
del smarthigh
gc.collect()
Out[140]:
0
In [141]:
too_low = (df_app3['power']>0) & (df_app3['power']<5)
df_app3.loc[too_low,['power']] = 0

bmw = (df_app3['brand'] == 'bmw') & (df_app3['model'] == 'bmw')
df_app3.loc[bmw,['model']] = None

opellow = (df_app3['brand'] == 'opel') & (df_app3['power'] > 0) & (df_app3['power']<40) & (df_app3['model'] != 'other')
df_app3.loc[opellow,['power']] = 0

vwlow = (df_app3['brand'] == 'volkswagen') & (df_app3['power']>0) & (df_app3['power']<30) & (df_app3['model'] != 'other')
df_app3.loc[vwlow,['power']] = 0

citroenlow = (df_app3['brand'] == 'citroen') & (df_app3['power']>0) & (df_app3['power'] < 32) & (df_app3['model'] != 'other')
df_app3.loc[citroenlow,['power']] = 0

fordlow = (df_app3['brand'] == 'ford') & (df_app3['power']>0) & (df_app3['power'] < 45) & (df_app3['model'] != 'other')
df_app3.loc[fordlow,['power']] = 0

renaultlow = (df_app3['brand'] == 'renault') & (df_app3['power']>0) & (df_app3['power'] < 32) & (df_app3['model'] != 'other')
df_app3.loc[renaultlow,['power']] = 0

merclow = (df_app3['brand'] == 'mercedes_benz') & (df_app3['power']>0) & (df_app3['power'] < 55) & (df_app3['model'] != 'other')
df_app3.loc[merclow,['power']] = 0

bmwlow = (df_app3['brand'] == 'bmw') & (df_app3['power']>0) & (df_app3['power'] < 55) & (df_app3['model'] != 'other')
df_app3.loc[bmwlow,['power']] = 0

audilow = (df_app3['brand'] == 'audi') & (df_app3['power']>0) & (df_app3['power'] < 44) & (df_app3['model'] != 'other')
df_app3.loc[audilow,['power']] = 0

fiatlow = (df_app3['brand'] == 'fiat') & (df_app3['power']>0) & (df_app3['power'] < 13) & (df_app3['model'] != 'other')
df_app3.loc[fiatlow,['power']] = 0

pelow = (df_app3['brand'] == 'peugeot') & (df_app3['power']>0) & (df_app3['power'] < 34) & (df_app3['model'] != 'other')
df_app3.loc[pelow,['power']] = 0

trabantlow = (df_app3['brand'] == 'trabant') & (df_app3['power']>0) & (df_app3['power'] < 23) & (df_app3['model'] != 'other')
df_app3.loc[trabantlow,['power']] = 0

nislow = (df_app3['brand'] == 'nissan') & (df_app3['power']>0) & (df_app3['power'] < 45) & (df_app3['model'] != 'other')
df_app3.loc[nislow,['power']] = 0

sk45 = (df_app3['brand'].isin(['mazda','smart','seat','skoda','mitsubishi','toyota','volvo','honda','suzuki'])) & (df_app3['power']>0) & (df_app3['power'] < 45) & (df_app3['model'] != 'other')
df_app3.loc[sk45,['power']] = 0

hylow = (df_app3['brand'].isin(['hyundai'])) & (df_app3['power']>0) & (df_app3['power'] < 49) & (df_app3['model'] != 'other')
df_app3.loc[hylow,['power']] = 0

subarulow = (df_app3['brand'].isin(['subaru'])) & (df_app3['power']>0) & (df_app3['power'] < 54) & (df_app3['model'] != 'other')
df_app3.loc[subarulow,['power']] = 0

dacialow = (df_app3['brand'].isin(['dacia'])) & (df_app3['power']>0) & (df_app3['power'] < 67) & (df_app3['model'] != 'other')
df_app3.loc[dacialow,['power']] = 0

k55 = (df_app3['brand'].isin(['rover','kia','lancia'])) & (df_app3['power']>0) & (df_app3['power'] < 55) & (df_app3['model'] != 'other')
df_app3.loc[k55,['power']] = 0

lrlow = (df_app3['brand'].isin(['land_rover'])) & (df_app3['power']>0) & (df_app3['power'] < 50) & (df_app3['model'] != 'other')
df_app3.loc[k55,['power']] = 0

fiat500low = (df_app3['brand'] == 'fiat') & (df_app3['model'] == '500') & (df_app3['registrationyear'] > 1975) & (df_app3['power']>0) & (df_app3['power']< 69) & (df_app3['model'] != 'other') & (df_app3['brand'] != 'sonstige_autos')
df_app3.loc[fiat500low,['power']] = 0

freelanderlow = (df_app3['brand'] == 'land_rover') & (df_app3['model'] == 'freelander') & (df_app3['power']>0) & (df_app3['power']< 109)
df_app3.loc[freelanderlow,['power']] = 0

pandalow = (df_app3['brand'] == 'fiat') & (df_app3['model'] == 'panda') & (df_app3['power'] > 0) & (df_app3['power']<30)
df_app3.loc[pandalow,['power']] = 0

seilow = (df_app3['brand'] == 'fiat') & (df_app3['model'] == 'seicento') & (df_app3['power'] > 0) & (df_app3['power']<39)
df_app3.loc[seilow,['power']] = 0

stilow = (df_app3['brand'] == 'fiat') & (df_app3['model'] == 'stilo') & (df_app3['power'] > 0) & (df_app3['power']<59)
df_app3.loc[stilow,['power']] = 0

beetle03 = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'beetle') & (df_app3['registrationyear'] >2002) & (df_app3['power'] > 0) & (df_app3['power']<75)
df_app3.loc[beetle03,['power']] = 0

polow = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'polo') & (df_app3['power']>0) & (df_app3['power'] < 37)
df_app3.loc[polow,['power']] = 0

luplow = (df_app3['model'] == 'lupo') & (df_app3['power']>0) & (df_app3['power'] < 45)
df_app3.loc[luplow,['power']] = 0

golflow = (df_app3['model'] == 'golf') & (df_app3['power']>0) & (df_app3['power'] < 50)
df_app3.loc[golflow,['power']] = 0

movlow = (df_app3['model'] == 'move') & (df_app3['power']>0) & (df_app3['power'] < 40)
df_app3.loc[movlow,['power']] = 0

sharanlow = (df_app3['model'] == 'sharan') & (df_app3['power']>0) & (df_app3['power'] < 90)
df_app3.loc[sharanlow,['power']] = 0

twinlow = (df_app3['model'] == 'twingo') & (df_app3['power']>0) & (df_app3['power'] < 40)
In [142]:
del too_low 
del bmw 
del opellow 
del vwlow 
del citroenlow 
del fordlow 
del renaultlow 
del merclow 
del bmwlow 
del audilow 
del fiatlow 
del pelow 
del trabantlow 
del nislow 
del sk45 
del hylow 
del subarulow 
del dacialow 
del k55
del lrlow 
del fiat500low
del freelanderlow
del pandalow 
del seilow 
del stilow 
del beetle03 
del polow 
del luplow 
del golflow 
del movlow 
del sharanlow 
del twinlow 
gc.collect()
Out[142]:
0
In [143]:
def fill_gearbox(df, threshold=0.9, verbose=True):
    df = df.copy()
    df['gearbox'] = df['gearbox'].str.lower().str.strip()
    
    fill_strategies = [
        ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
        ['brand', 'model', 'fueltype', 'vehicletype'],
        ['brand', 'model', 'fueltype'],
        ['brand', 'model'],
        ['brand']
    ]
    
    total_filled = 0
    start_missing = df['gearbox'].isna().sum()

    for cols in fill_strategies:
        # Count how many "auto" and "manual" in each group
        group_counts = (
            df.dropna(subset=['gearbox'])
            .groupby(cols)['gearbox']
            .value_counts(normalize=True)
            .rename('ratio')
            .reset_index()
        )
        
        # Keep only those where ratio >= threshold
        group_confident = (
            group_counts[group_counts['ratio'] >= threshold]
            .drop_duplicates(subset=cols)
            .rename(columns={'gearbox': 'fill_value'})
            .drop(columns=['ratio'])
        )

        if group_confident.empty:
            continue
        
        df = df.merge(group_confident, on=cols, how='left', suffixes=('', '_fill'))

        mask = df['gearbox'].isna() & df['fill_value'].notna()
        filled_now = mask.sum()

        df.loc[mask, 'gearbox'] = df.loc[mask, 'fill_value']
        df.drop(columns='fill_value', inplace=True)

        total_filled += filled_now
        if verbose and filled_now > 0:
            print(f"Filled {filled_now} missing gearbox values using {cols} (≥{threshold*100:.0f}% majority rule)")

        if df['gearbox'].isna().sum() == 0:
            break

    if verbose:
        end_missing = df['gearbox'].isna().sum()
        print(f"\n✅ Gearbox filling complete: {start_missing - end_missing} filled, {end_missing} still missing.")

    return df
In [144]:
df_app3g = fill_gearbox(df_app3)
Filled 7629 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥90% majority rule)
Filled 1128 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥90% majority rule)
Filled 438 missing gearbox values using ['brand', 'model', 'fueltype'] (≥90% majority rule)
Filled 1612 missing gearbox values using ['brand', 'model'] (≥90% majority rule)
Filled 1678 missing gearbox values using ['brand'] (≥90% majority rule)

✅ Gearbox filling complete: 12485 filled, 7345 still missing.
In [145]:
df_app3g.to_pickle('checkpoint_01.pkl')
In [146]:
cvt = (df_app3g['model'].isin(['corsa'])) & (df_app3g['vehicletype'].isin(['wagon','coupe','bus']))
df_app3g.loc[cvt,'vehicletype'] = np.nan

gbus = (df_app3g['model'].isin(['golf'])) & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[gbus,['vehicletype']] = np.nan

puv = (df_app3g['model'].isin(['polo'])) & (df_app3g['vehicletype'].isin(['bus', 'suv']))
df_app3g.loc[puv,['vehicletype']] = np.nan

bmwsuv = (df_app3g['model'].isin(['3er'])) & (df_app3g['vehicletype'].isin(['bus', 'suv']))
df_app3g.loc[bmwsuv,['vehicletype']] = np.nan

astrabus = (df_app3g['model'].isin(['astra'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[astrabus,['vehicletype']] = np.nan

nosuv = (df_app3g['vehicletype'] == 'suv') & (df_app3g['model'].isin(['beetle','combo','transporter','vectra', 'verso','500','vito','a3','vivaro','a4','transit','m_reihe','astra','b_klasse','slk','corolla','corsa','doblo','r19','fabia','focus','picanto', 'omega', '147']))
df_app3g.loc[nosuv,['vehicletype']] = np.nan

noconvertible = (df_app3g['vehicletype'] == 'convertible') & (df_app3g['model'].isin(['ypsilon','100','passat','200','7er','90','a_klasse','antara','c2','calibra','forester','galaxy','glk','i3','kuga','nubira','zafira']))
df_app3g.loc[noconvertible,['vehicletype']] = np.nan

nocoupe = (df_app3g['vehicletype'] == 'coupe') & (df_app3g['model'].isin(['micra','aygo','9000','v70','a1','arosa','toledo','bora','ptcruiser','cx_reihe','seicento','getz','meriva','zafira']))
df_app3g.loc[nocoupe,['vehicletype']] = np.nan

nobus = (df_app3g['vehicletype'] == 'bus') & (df_app3g['model'].isin(['c5','civic','mondeo','astra','tucson','antara','a4','5er','4_reihe','x_trail','a6','sl','tigra','swift','micra','santa','forester','galant','justy','punto','panda','pajero','outlander','omega','m_klasse','mx_reihe','materia','lancer']))
df_app3g.loc[nobus,['vehicletype']] = np.nan

nowagon = (df_app3g['vehicletype'] == 'wagon') & (df_app3g['model'].isin(['jazz','calibra','200','getz','twingo','yeti','g_klasse','fox','arosa','clk','i3','musa','touareg','lanos','micra','a2','90','q3','lupo','santa','kappa','kalos','sl','niva','spark','slk']))
df_app3g.loc[nowagon,['vehicletype']] = np.nan

nosedan = (df_app3g['vehicletype'] == 'sedan') & (df_app3g['model'].isin(['v50','galaxy','z_reihe','s_max','materia','forester','tucson','move','cayenne','spider','sorento','cx_reihe','antara','rav','combo','cr_reihe']))
df_app3g.loc[nosedan,['vehicletype']] = np.nan

nosmall = (df_app3g['vehicletype'] == 'small') & (df_app3g['model'].isin(['doblo','verso','vivaro','6_reihe','defender','kuga','croma','m_reihe','grand','cayenne','rangerover','a6','sportage','accord','octavia','impreza','s_type','s_klasse','rx_reihe']))
df_app3g.loc[nosmall,['vehicletype']] = np.nan

noaudi = (df_app3g['model'] == 'audi')
df_app3g.loc[noaudi,['model']] = np.nan

notrab = (df_app3g['brand'] == 'trabant') & (df_app3g['model'] == '601') & (df_app3g['vehicletype'].isin(['coupe','suv']))
df_app3g.loc[notrab,['vehicletype']] = np.nan

nokiacoupe = (df_app3g['brand'].isin(['kia'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[nokiacoupe,['vehicletype']] = np.nan

daewc = (df_app3g['brand'].isin(['daewoo'])) & (df_app3g['model'] == 'lanos') & (df_app3g['vehicletype'].isin(['coupe','wagon']))
df_app3g.loc[daewc,['vehicletype']] = np.nan

lanc = (df_app3g['brand'] == 'lancia')  & (df_app3g['model'].isin(['kappa','delta'])) & (df_app3g['vehicletype'] == 'coupe')
df_app3g.loc[lanc,['vehicletype']] = np.nan

alfa147 = (df_app3g['brand'] == 'alfa_romeo') & (df_app3g['model'] == '147') & ~(df_app3g['vehicletype'].isin(['small','other']))
df_app3g.loc[alfa147,['vehicletype']] = np.nan

rovernos = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['rangerover'])) & ~(df_app3g['vehicletype'].isin(['suv','other']))
df_app3g.loc[rovernos,['vehicletype']] = np.nan

ibizano = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['ibiza'])) & ~(df_app3g['vehicletype'].isin(['other','small','sedan']))
df_app3g.loc[ibizano,['vehicletype']] = np.nan

alteano = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['altea'])) & ~(df_app3g['vehicletype'].isin(['other','small']))
df_app3g.loc[alteano,['vehicletype']] = np.nan

focuscb = (df_app3g['brand'] == 'ford') & (df_app3g['model'] == 'focus')  & (df_app3g['vehicletype'].isin(['coupe','bus']))
df_app3g.loc[focuscb,['vehicletype']] = np.nan

ccw = (df_app3g['brand'] == 'chrysler') & (df_app3g['model'] == 'crossfire') & (df_app3g['vehicletype'] == 'wagon')
df_app3g.loc[ccw,['vehicletype']] = np.nan

slcs = (df_app3g['brand'] == 'seat') & (df_app3g['model'] == 'leon') & (df_app3g['vehicletype'].isin(['coupe','sedan']))
df_app3g.loc[slcs,['vehicletype']] = np.nan

mcb = (df_app3g['brand'] == 'mazda') & (df_app3g['model'] == '3_reihe') & (df_app3g['vehicletype'].isin(['coupe','bus']))
df_app3g.loc[mcb,['vehicletype']] = np.nan

calc = (df_app3g['brand'] == 'opel') & (df_app3g['model'] == 'calibra') & (df_app3g['vehicletype'] != 'coupe')
df_app3g.loc[calc,['vehicletype']] = np.nan

hicsb = (df_app3g['brand'] == 'hyundai') & (df_app3g['model'] == 'i_reihe') & (df_app3g['vehicletype'].isin(['coupe','suv','bus']))
df_app3g.loc[hicsb,['vehicletype']] = np.nan

f500 = (df_app3g['brand'] == 'fiat') & (df_app3g['model'] == '500') & ~(df_app3g['vehicletype'].isin(['small','convertible']))
df_app3g.loc[f500,['vehicletype']] = np.nan

fpun = (df_app3g['brand'] == 'fiat') & (df_app3g['model'] == 'punto')  & (df_app3g['vehicletype'].isin(['coupe','sedan']))
df_app3g.loc[fpun,['vehicletype']] = np.nan

daian = (df_app3g['brand'] == 'daihatsu') & (df_app3g['model'] == 'terios') & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[daian,['vehicletype']] = np.nan

ladan = (df_app3g['brand'] == 'lada') & (df_app3g['model'] == 'niva') & (df_app3g['vehicletype'].isin(['bus','sedan']))
df_app3g.loc[ladan,['vehicletype']] = np.nan

aq5 = (df_app3g['brand'] == 'audi') & (df_app3g['model'] == 'q5') & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[aq5,['vehicletype']] = np.nan

aq7 = (df_app3g['brand'] == 'audi') & (df_app3g['model'] == 'q7') & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[aq7,['vehicletype']] = np.nan

dd = (df_app3g['brand'] == 'dacia') & (df_app3g['model'] == 'duster') & (df_app3g['vehicletype'].isin(['bus','wagon']))
df_app3g.loc[dd,['vehicletype']] = np.nan

tr = (df_app3g['brand'] == 'toyota') & (df_app3g['model'] == 'rav') & (df_app3g['vehicletype'].isin(['small','convertible']))
df_app3g.loc[tr,['vehicletype']] = np.nan

vxc = (df_app3g['brand'] == 'volvo') & (df_app3g['model'] == 'xc_reihe') & (df_app3g['vehicletype'].isin(['wagon','sedan']))
df_app3g.loc[vxc,['vehicletype']] = np.nan

sandan = (df_app3g['brand'] == 'dacia') & (df_app3g['model'] == 'sandero') & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[sandan,['vehicletype']] = np.nan

sju = (df_app3g['brand'] == 'subaru') & (df_app3g['model'] == 'justy') & (df_app3g['vehicletype'].isin(['suv','sedan','wagon']))
df_app3g.loc[sju,['vehicletype']] = np.nan

lym = (df_app3g['brand'] == 'lancia') & (df_app3g['model'].isin(['ypsilon','musa'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[lym,['vehicletype']] = np.nan

dmat = (df_app3g['brand'] == 'daewoo') & (df_app3g['model'].isin(['matiz'])) & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[dmat,['vehicletype']] = np.nan

tay = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['yaris'])) & (df_app3g['vehicletype'].isin(['bus','wagon']))
df_app3g.loc[tay,['vehicletype']] = np.nan

tayr = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['aygo','auris'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[tayr,['vehicletype']] = np.nan

tus = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['corolla'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[tus,['vehicletype']] = np.nan

coops = (df_app3g['brand'] == 'mini') & (df_app3g['model'].isin(['cooper'])) & (df_app3g['vehicletype'].isin(['suv','wagon','bus']))
df_app3g.loc[coops,['vehicletype']] = np.nan

coops = (df_app3g['brand'] == 'mini') & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[coops,['vehicletype']] = np.nan

mone = (df_app3g['brand'] == 'mini') & (df_app3g['model'].isin(['one'])) & (df_app3g['vehicletype'].isin(['suv','sedan']))
df_app3g.loc[mone,['vehicletype']] = np.nan

clubmn = (df_app3g['brand'] == 'mini') & (df_app3g['model'].isin(['clubman'])) & (df_app3g['vehicletype'].isin(['coupe','sedan']))
df_app3g.loc[clubmn,['vehicletype']] = np.nan

suzsw = (df_app3g['brand'] == 'suzuki') & (df_app3g['model'].isin(['swift'])) & (df_app3g['vehicletype'].isin(['wagon','sedan']))
df_app3g.loc[suzsw,['vehicletype']] = np.nan

cit12 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c1','c2'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[cit12,['vehicletype']] = np.nan

cit4 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c4'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[cit4,['vehicletype']] = np.nan

cit3 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c3'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','bus']))
df_app3g.loc[cit3,['vehicletype']] = np.nan

kr = (df_app3g['brand'] == 'kia') & (df_app3g['model'].isin(['rio'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[kr,['vehicletype']] = np.nan

cs = (df_app3g['brand'] == 'chevrolet') & (df_app3g['model'].isin(['spark'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[cs,['vehicletype']] = np.nan

p2 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['2_reihe'])) & (df_app3g['vehicletype'].isin(['suv','bus']))
df_app3g.loc[p2,['vehicletype']] = np.nan

p1 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['1_reihe'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','convertible']))
df_app3g.loc[p1,['vehicletype']] = np.nan

p3 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['3_reihe'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[p3,['vehicletype']] = np.nan

hg = (df_app3g['brand'] == 'hyundai') & (df_app3g['model'].isin(['getz'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[hg,['vehicletype']] = np.nan

oc = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['corsa'])) & (df_app3g['vehicletype'].isin(['coupe','convertible']))
df_app3g.loc[oc,['vehicletype']] = np.nan

oa = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['agila'])) & (df_app3g['vehicletype'].isin(['bus','wagon','sedan']))
df_app3g.loc[oa,['vehicletype']] = np.nan

omer = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['meriva'])) & (df_app3g['vehicletype'].isin(['bus','suv','sedan']))
df_app3g.loc[omer,['vehicletype']] = np.nan

ok = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['kadett'])) & (df_app3g['vehicletype'].isin(['suv']))
df_app3g.loc[ok,['vehicletype']] = np.nan

oz = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['zafira'])) & (df_app3g['vehicletype'].isin(['suv','sedan']))
df_app3g.loc[oz,['vehicletype']] = np.nan

hj = (df_app3g['brand'] == 'honda') & (df_app3g['model'].isin(['jazz'])) & (df_app3g['vehicletype'].isin(['bus','coupe','sedan']))
df_app3g.loc[hj,['vehicletype']] = np.nan


mak = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['a_klasse'])) & (df_app3g['vehicletype'].isin(['bus','suv','wagon','coupe']))
df_app3g.loc[mak,['vehicletype']] = np.nan

mbk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['b_klasse'])) & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[mbk,['vehicletype']] = np.nan

mck = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['clk'])) & (df_app3g['vehicletype'].isin(['sedan','small','suv']))
df_app3g.loc[mck,['vehicletype']] = np.nan

msk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['sprinter'])) & (df_app3g['vehicletype'].isin(['sedan','small']))
df_app3g.loc[msk,['vehicletype']] = np.nan

mvk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['viano'])) & (df_app3g['vehicletype'].isin(['sedan','small']))
df_app3g.loc[mvk,['vehicletype']] = np.nan

mvtk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['vito'])) & (df_app3g['vehicletype'].isin(['small']))
df_app3g.loc[mvtk,['vehicletype']] = np.nan


nn = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['note'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[nn,['vehicletype']] = np.nan

ff = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['fiesta'])) & (df_app3g['vehicletype'].isin(['bus','convertible']))
df_app3g.loc[ff,['vehicletype']] = np.nan

fk = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['ka'])) & (df_app3g['vehicletype'].isin(['coupe','wagon','convertible']))
df_app3g.loc[fk,['vehicletype']] = np.nan

ffu = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['fusion'])) & (df_app3g['vehicletype'].isin(['wagon','bus']))
df_app3g.loc[ffu,['vehicletype']] = np.nan

ffo = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['focus'])) & (df_app3g['vehicletype'].isin(['bus','suv','coupe','convertible']))
df_app3g.loc[ffo,['vehicletype']] = np.nan

fe = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['escort'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[fe,['vehicletype']] = np.nan

fm = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['mondeo'])) & (df_app3g['vehicletype'].isin(['small','coupe']))
df_app3g.loc[fm,['vehicletype']] = np.nan

sf2 = (df_app3g['brand'] == 'smart') & (df_app3g['model'].isin(['fortwo'])) & (df_app3g['vehicletype'].isin(['bus','sedan']))
df_app3g.loc[sf2,['vehicletype']] = np.nan

sf4 = (df_app3g['brand'] == 'smart') & (df_app3g['model'].isin(['forfour'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','coupe','convertible']))
df_app3g.loc[sf4,['vehicletype']] = np.nan

sf4 = (df_app3g['brand'] == 'smart') & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[sf4,['vehicletype']] = np.nan

fsed = (df_app3g['brand'] == 'fiat') & (df_app3g['model'].isin(['panda','seicento'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[fsed,['vehicletype']] = np.nan

sleo = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['leon'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[sleo,['vehicletype']] = np.nan

sm = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['mii'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[sm,['vehicletype']] = np.nan

rc = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['clio'])) & (df_app3g['vehicletype'].isin(['coupe','bus']))
df_app3g.loc[rc,['vehicletype']] = np.nan

rt = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['twingo'])) & (df_app3g['vehicletype'].isin(['sedan','coupe','convertible']))
df_app3g.loc[rt,['vehicletype']] = np.nan

rm = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['modus'])) & (df_app3g['vehicletype'].isin(['sedan','bus','wagon']))
df_app3g.loc[rm,['vehicletype']] = np.nan

rme = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['megane'])) & (df_app3g['vehicletype'].isin(['suv','bus','small']))
df_app3g.loc[rme,['vehicletype']] = np.nan

rk = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['kangoo'])) & (df_app3g['vehicletype'].isin(['suv','sedan','small']))
df_app3g.loc[rk,['vehicletype']] = np.nan

skf = (df_app3g['brand'] == 'skoda') & (df_app3g['model'].isin(['fabia'])) & (df_app3g['vehicletype'].isin(['bus','convertible']))
df_app3g.loc[skf,['vehicletype']] = np.nan

skc = (df_app3g['brand'] == 'skoda') & (df_app3g['model'].isin(['citigo'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[skc,['vehicletype']] = np.nan

vwp = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['polo'])) & (df_app3g['vehicletype'].isin(['wagon','coupe','convertible']))
df_app3g.loc[vwp,['vehicletype']] = np.nan

vwu = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['up'])) & (df_app3g['vehicletype'].isin(['sedan','suv']))
df_app3g.loc[vwu,['vehicletype']] = np.nan


vwg = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['golf','passat'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[vwg,['vehicletype']] = np.nan


vwb = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['beetle'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[vwb,['vehicletype']] = np.nan

vwc = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['caddy'])) & (df_app3g['vehicletype'].isin(['small','suv','sedan','convertible']))
df_app3g.loc[vwc,['vehicletype']] = np.nan

vwf = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['fox'])) & (df_app3g['vehicletype'].isin(['coupe','convertible']))
df_app3g.loc[vwf,['vehicletype']] = np.nan

vwl = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['lupo'])) & (df_app3g['vehicletype'].isin(['coupe','convertible', 'bus','sedan']))
df_app3g.loc[vwl,['vehicletype']] = np.nan

vws = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['scirocco'])) & (df_app3g['vehicletype'].isin(['small','convertible','sedan']))
df_app3g.loc[vws,['vehicletype']] = np.nan

vwt = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['touran'])) & (df_app3g['vehicletype'].isin(['small','convertible','sedan','suv','wagon']))
df_app3g.loc[vwt,['vehicletype']] = np.nan

vwj = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['jetta'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[vwj,['vehicletype']] = np.nan

vwsh = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['sharan'])) & (df_app3g['vehicletype'].isin(['small','wagon','sedan','suv']))
df_app3g.loc[vwsh,['vehicletype']] = np.nan

vwtrans = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['transporter'])) & (df_app3g['vehicletype'].isin(['small','sedan','wagon']))
df_app3g.loc[vwtrans,['vehicletype']] = np.nan

bmwx = (df_app3g['brand'] == 'bmw') & (df_app3g['model'].isin(['x_reihe'])) & (df_app3g['vehicletype'].isin(['wagon','sedan','bus']))
df_app3g.loc[bmwx,['vehicletype']] = np.nan

b5 = (df_app3g['brand'] == 'bmw') & (df_app3g['model'].isin(['5er'])) & (df_app3g['vehicletype'].isin(['small','suv']))
df_app3g.loc[b5,['vehicletype']] = np.nan

b1 = (df_app3g['brand'] == 'bmw') & (df_app3g['model'].isin(['1er'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[b1,['vehicletype']] = np.nan


maz3 = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['3_reihe'])) & (df_app3g['vehicletype'].isin(['wagon','coupe','convertible']))
df_app3g.loc[maz3,['vehicletype']] = np.nan

maz6 = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['6_reihe'])) & (df_app3g['vehicletype'].isin(['coupe','convertible','bus','small']))
df_app3g.loc[maz6,['vehicletype']] = np.nan


mbck = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['c_klasse'])) & (df_app3g['vehicletype'].isin(['bus','small','other']))
df_app3g.loc[mbck,['vehicletype']] = np.nan

mbck = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'] == 'c_klasse') & (df_app3g['registrationyear'] == 2001) & (df_app3g['power'] == 122) & (df_app3g['fueltype'] == 'gasoline') & (df_app3g['mileage'] == 150000) & (df_app3g['price'] > 1799) & (df_app3g['price'] < 3501)
df_app3g.loc[mbck,['vehicletype']] = 'sedan'

mbek = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['e_klasse'])) & (df_app3g['vehicletype'].isin(['bus','small','suv']))
df_app3g.loc[mbek,['vehicletype']] = np.nan

mbsk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['s_klasse'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[mbsk,['vehicletype']] = np.nan

mbcs = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['cl','sl'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[mbcs,['vehicletype']] = np.nan

mbglk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['glk'])) & (df_app3g['vehicletype'].isin(['sedan','coupe']))
df_app3g.loc[mbglk,['vehicletype']] = np.nan

vwbor = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['bora'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[vwbor,['vehicletype']] = np.nan

aa4 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a4'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[aa4,['vehicletype']] = np.nan

aa6 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a6'])) & (df_app3g['vehicletype'].isin(['suv']))
df_app3g.loc[aa6,['vehicletype']] = np.nan

aa8 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a8'])) & (df_app3g['vehicletype'].isin(['small','wagon']))
df_app3g.loc[aa8,['vehicletype']] = np.nan

aa5 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a5'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[aa5,['vehicletype']] = np.nan

aa1 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a1','q3'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[aa1,['vehicletype']] = np.nan

fc = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['c_max'])) & (df_app3g['vehicletype'].isin(['sedan','bus','suv']))
df_app3g.loc[fc,['vehicletype']] = np.nan

fm = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['mustang'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[fm,['vehicletype']] = np.nan

ov = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['vectra'])) & (df_app3g['vehicletype'].isin(['small','bus','convertible']))
df_app3g.loc[ov,['vehicletype']] = np.nan

os = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['signum'])) & (df_app3g['vehicletype'].isin(['sedan','bus']))
df_app3g.loc[os,['vehicletype']] = np.nan

omega = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['omega'])) & (df_app3g['vehicletype'].isin(['small']))
df_app3g.loc[omega,['vehicletype']] = np.nan

p5 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['5_reihe'])) & (df_app3g['vehicletype'].isin(['coupe','small','convertible']))
df_app3g.loc[p5,['vehicletype']] = np.nan

p4 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['4_reihe'])) & (df_app3g['vehicletype'].isin(['suv','small']))
df_app3g.loc[p4,['vehicletype']] = np.nan

rlag = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['laguna'])) & (df_app3g['vehicletype'].isin(['coupe','small','convertible']))
df_app3g.loc[rlag,['vehicletype']] = np.nan

rsc = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['scenic'])) & (df_app3g['vehicletype'].isin(['suv','sedan','bus']))
df_app3g.loc[rsc,['vehicletype']] = np.nan

ml = (df_app3g['brand'] == 'mitsubishi') & (df_app3g['model'].isin(['lancer'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[ml,['vehicletype']] = np.nan

mco = (df_app3g['brand'] == 'mitsubishi') & (df_app3g['model'].isin(['colt'])) & (df_app3g['vehicletype'].isin(['suv','wagon','bus','sedan']))
df_app3g.loc[mco,['vehicletype']] = np.nan

mout = (df_app3g['brand'] == 'mitsubishi') & (df_app3g['model'].isin(['outlander'])) & (df_app3g['vehicletype'].isin(['wagon','sedan']))
df_app3g.loc[mout,['vehicletype']] = np.nan

cc5 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c5'])) & (df_app3g['vehicletype'].isin(['small','bus']))
df_app3g.loc[cc5,['vehicletype']] = np.nan

st = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['toledo'])) & (df_app3g['vehicletype'].isin(['small','bus']))
df_app3g.loc[st,['vehicletype']] = np.nan

tv = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['verso'])) & (df_app3g['vehicletype'].isin(['sedan','bus']))
df_app3g.loc[tv,['vehicletype']] = np.nan

ta = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['avensis'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[ta,['vehicletype']] = np.nan

vv40 = (df_app3g['brand'] == 'volvo') & (df_app3g['model'].isin(['v40'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[vv40,['vehicletype']] = np.nan

vcr = (df_app3g['brand'] == 'volvo') & (df_app3g['model'].isin(['c_reihe'])) & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[vcr,['vehicletype']] = np.nan

fbrav = (df_app3g['brand'] == 'fiat') & (df_app3g['model'].isin(['bravo'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','coupe']))
df_app3g.loc[fbrav,['vehicletype']] = np.nan

c300 = (df_app3g['brand'] == 'chrysler') & (df_app3g['model'].isin(['300c'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[c300,['vehicletype']] = np.nan

dand = (df_app3g['brand'] == 'dacia') & (df_app3g['model'].isin(['logan'])) & (df_app3g['vehicletype'].isin(['suv']))
df_app3g.loc[dand,['vehicletype']] = np.nan

land = (df_app3g['brand'] == 'lancia') & (df_app3g['model'].isin(['delta']))
df_app3g.loc[land,['vehicletype']] = 'other'

rdef = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['defender']))
df_app3g.loc[rdef,['vehicletype']] = 'suv'

jb = (df_app3g['brand'] == 'jeep') & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[jb,['vehicletype']] = np.nan

rdisc = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['discovery'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[rdisc,['vehicletype']] = np.nan

norover = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['defender','freelander','discovery','rangerover']))
df_app3g.loc[norover,['brand']] = 'land_rover'

lrfree = (df_app3g['brand'] == 'land_rover') & (df_app3g['model'].isin(['freelander'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[lrfree,['vehicletype']] = np.nan

nq = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['qashqai'])) & (df_app3g['vehicletype'].isin(['sedan','bus','wagon']))
df_app3g.loc[nq,['vehicletype']] = np.nan

nq = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['qashqai'])) & (df_app3g['vehicletype'].isna())
df_app3g.loc[nq,['vehicletype']] = 'suv'

nnav = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['navara'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[nnav,['vehicletype']] = np.nan

nnav = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['navara'])) & (df_app3g['vehicletype'].isna())
df_app3g.loc[nnav,['vehicletype']] = 'suv'

hcr = (df_app3g['brand'] == 'honda') & (df_app3g['model'].isin(['cr_reihe'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[hcr,['vehicletype']] = np.nan

mcon = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['5_reihe','cx_reihe','1_reihe'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[mcon,['vehicletype']] = np.nan

maz5 = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['5_reihe'])) & (df_app3g['vehicletype'].isin(['suv','wagon','sedan']))
df_app3g.loc[maz5,['vehicletype']] = np.nan

cit3 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c3'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[cit3,['vehicletype']] = np.nan

fvert = (df_app3g['brand'] == 'fiat') & (df_app3g['model'].isin(['punto','panda'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[fvert,['vehicletype']] = np.nan

zuvert = (df_app3g['brand'] == 'suzuki') & (df_app3g['model'].isin(['swift','grand'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[zuvert,['vehicletype']] = np.nan

mgk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['g_klasse'])) & (df_app3g['vehicletype'].isin(['convertible','sedan']))
df_app3g.loc[mgk,['vehicletype']] = np.nan

arsp = (df_app3g['brand'] == 'alfa_romeo') & (df_app3g['model'].isin(['spider'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[arsp,['vehicletype']] = np.nan

toua = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'] == 'tiguan') & (df_app3g['vehicletype'].isin(['sedan','bus']))
df_app3g.loc[toua,['vehicletype']] = np.nan

toua = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'] == 'tiguan') & (df_app3g['vehicletype'].isna())
df_app3g.loc[toua,['vehicletype']] = 'suv'

ptc = (df_app3g['brand'] == 'chrysler') & (df_app3g['model'] == 'ptcruiser') & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[ptc,['vehicletype']] = np.nan
In [147]:
print(df_app3g.memory_usage(deep=True).sum() / 1_000_000, "MB")
248.205326 MB
In [148]:
gc.collect()
Out[148]:
0
In [149]:
del toua
del jb
del rdisc
del norover
del lrfree
del nq
del nnav
del hcr
del mcon
del maz5
del fvert
del zuvert
del mgk
del arsp
del bmwx
del b5
del b1
del maz3
del maz6
del mbck
del mbek
del mbsk
del mbcs
del mbglk
del vwbor
del aa4
del aa6
del aa8
del aa5
del aa1
del fc
del fm
del ov
del os
del omega
del p5
del p4
del rlag
del rsc
del ml
del mco
del mout
del cc5
del st
del tv
del ta
del vv40
del vcr
del fbrav
del c300
del dand
del land
del rdef
del fsed
del sleo
del sm
del rc
del rt
del rm
del rme
del rk
del skf
del skc
del vwp
del vwu
del vwg
del vwb
del vwc
del vwf
del vwl
del vws
del vwt
del vwj
del vwsh
del vwtrans
del kr
del cs
del p2
del p1
del p3
del hg
del oc
del oa
del omer
del ok
del oz
del hj
del mak
del mbk
del mck
del msk
del mvk
del mvtk
del nn
del ff
del fk
del ffu
del ffo
del fe
del sf2
del sf4
del cit4
del cit3
del cit12
del suzsw
del clubmn
del mone
del coops
del tus
del tayr
del tay
del dmat
del lym
del sju
del sandan
del vxc
del tr
del dd
del aq7
del aq5
del ladan
del daian
del fpun
del f500
del hicsb
del calc
del mcb
del slcs
del ccw
del focuscb
del alteano
del ibizano
del rovernos
del alfa147
del lanc
del daewc
del nokiacoupe
del notrab

gc.collect()
Out[149]:
0
In [150]:
def fill_all_missing_values(df, repeat_until_change=True, threshold=0.7, verbose=True, max_iterations=10):
    """
    Fill missing values for power, vehicletype, model, fueltype using tiered group strategies.
    Optimized version with better memory management and early stopping.
    """
    df = df.copy()

    def safe_mode(series):
        """Return mode if confident enough (>= threshold), else NaN."""
        s = series.dropna()
        if len(s) == 0:
            return np.nan
        counts = s.value_counts(normalize=True)
        if len(counts) == 0:
            return np.nan
        top_val, top_freq = counts.index[0], counts.iloc[0]
        return top_val if top_freq >= threshold else np.nan

    def is_zero_condition(condition):
        """Heuristic to detect if condition checks for zero (lambda x: x == 0)."""
        try:
            test = condition(pd.Series([0, np.nan], dtype=object))
            if isinstance(test, (bool, np.bool_)) and test:
                return True
            if hasattr(test, "__len__") and len(test) >= 1:
                return bool(test.iloc[0] if hasattr(test, 'iloc') else test[0])
        except Exception:
            pass
        return False

    def make_key_tuple(row_vals):
        """Helper: convert list-like row values to a hashable tuple with None for NaN."""
        return tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row_vals)

    def fill_column(target_col, fill_strategies, condition=lambda x: x.isna()):
        total_filled = 0
        zero_check = is_zero_condition(condition)

        # Track initial state
        if zero_check:
            initial_missing = (df[target_col] == 0).sum()
        else:
            initial_missing = df[target_col].isna().sum()
        
        if initial_missing == 0:
            return 0
        
        if verbose:
            print(f"  → Starting with {initial_missing:,} missing values in '{target_col}'")

        for cols in fill_strategies:
            # Check if there's still work to do
            if zero_check:
                current_missing = (df[target_col] == 0).sum()
            else:
                current_missing = df[target_col].isna().sum()
            
            if current_missing == 0:
                break
            
            start_time = time.time()

            try:
                # Compute group modes using safe_mode
                group_modes = (
                    df.groupby(cols, dropna=False)[target_col]
                    .apply(safe_mode)
                    .reset_index()
                    .rename(columns={target_col: 'fill_value'})
                )
                
                # Remove groups with no valid fill value
                group_modes = group_modes[group_modes['fill_value'].notna()]
                
                if len(group_modes) == 0:
                    continue

            except Exception as e:
                if verbose:
                    print(f"⚠️ Skipping strategy {cols} — groupby failed: {str(e)[:50]}")
                continue

            # Build mapping dict from group_modes
            keys = group_modes[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
            mapping = dict(zip(keys, group_modes['fill_value'].values))

            # Compute fill_value per-row by mapping (keeps original row order)
            row_keys = df[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
            fill_series = row_keys.map(mapping)

            # Create mask of rows that need filling AND have a candidate fill_value
            mask_need = condition(df[target_col])
            mask_candidate = fill_series.notna()
            mask = mask_need & mask_candidate

            # Count before
            if zero_check:
                before_missing = (df[target_col] == 0).sum()
            else:
                before_missing = df[target_col].isna().sum()

            # Perform fill
            if mask.any():
                df.loc[mask, target_col] = fill_series.loc[mask].values

            # Count after
            if zero_check:
                after_missing = (df[target_col] == 0).sum()
            else:
                after_missing = df[target_col].isna().sum()

            filled_now = before_missing - after_missing
            total_filled += int(filled_now)

            if verbose and filled_now > 0:
                elapsed = time.time() - start_time
                print(f"✅ Filled {int(filled_now):,} values in '{target_col}' using {cols} ({after_missing:,} remaining, took {elapsed:.2f}s)")

        return total_filled

    iteration = 0
    while iteration < max_iterations:
        iteration += 1
        total_filled = 0
        if verbose:
            print(f"\n🌀 Iteration {iteration} starting...")

        # --- POWER ---
        power_strategies = [
            ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'],
            ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
            ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'],
            ['brand', 'model', 'vehicletype', 'year_bin'],
            ['brand', 'model', 'vehicletype', 'registrationyear'],
            ['brand', 'model', 'vehicletype', 'gearbox'],
            ['brand', 'model', 'vehicletype'],
            ['brand', 'model', 'fueltype', 'vehicletype'],
            ['brand', 'model', 'fueltype', 'year_bin'],
            ['brand', 'model', 'fueltype', 'registrationyear'],
            ['brand', 'model', 'fueltype', 'gearbox'],
            ['brand', 'model', 'year_bin'],
            ['brand', 'model', 'registrationyear'],
            ['brand', 'model'],
            ['brand', 'vehicletype'],
            ['brand', 'year_bin'],
            ['brand', 'registrationyear'],
            ['brand', 'gearbox'],
            ['brand']
        ]
        total_filled += fill_column('power', power_strategies, condition=lambda x: x == 0)

        # --- VEHICLE TYPE ---
        vehicletype_strategies = [
            ['brand', 'model', 'power', 'year_bin'],
            ['brand', 'model', 'power', 'registrationyear'],
            ['brand', 'model', 'power', 'gearbox'],
            ['brand', 'model', 'year_bin'],
            ['brand', 'model', 'registrationyear'],
            ['brand', 'model', 'power'],
            ['brand', 'model', 'gearbox'],
            ['brand', 'model'],
            ['brand', 'year_bin'],
            ['brand', 'registrationyear'],
            ['brand', 'power'],
            ['brand', 'gearbox'],
            ['brand']
        ]
        total_filled += fill_column('vehicletype', vehicletype_strategies)

        # --- MODEL ---
        model_strategies = [
            ['brand', 'vehicletype', 'power', 'year_bin'],
            ['brand', 'vehicletype', 'power', 'registrationyear'],
            ['brand', 'vehicletype', 'power', 'gearbox'],
            ['brand', 'vehicletype', 'year_bin'],
            ['brand', 'vehicletype', 'registrationyear'],
            ['brand', 'vehicletype', 'power'],
            ['brand', 'vehicletype', 'gearbox'],
            ['brand', 'vehicletype'],
            ['brand', 'power'],
            ['brand', 'year_bin'],
            ['brand', 'registrationyear'],
            ['brand', 'gearbox'],
            ['brand']
        ]
        total_filled += fill_column('model', model_strategies)

        # --- FUELTYPE ---
        fueltype_strategies = [
            ['brand', 'model', 'vehicletype', 'power', 'year_bin'],
            ['brand', 'model', 'vehicletype', 'power', 'registrationyear'],
            ['brand', 'model', 'vehicletype', 'power', 'gearbox'],
            ['brand', 'model', 'vehicletype', 'year_bin'],
            ['brand', 'model', 'vehicletype', 'registrationyear'],
            ['brand', 'model', 'vehicletype', 'gearbox'],
            ['brand', 'model', 'power', 'year_bin'],
            ['brand', 'model', 'power', 'registrationyear'],
            ['brand', 'model', 'power', 'gearbox'],
            ['brand', 'model', 'power'],
            ['brand', 'model'],
            ['brand', 'year_bin'],
            ['brand', 'registrationyear'],
            ['brand', 'gearbox'],
            ['brand']
        ]
        total_filled += fill_column('fueltype', fueltype_strategies)

        if verbose:
            print(f"🔁 Iteration {iteration} filled {total_filled:,} total values")

        if not repeat_until_change or total_filled == 0:
            if verbose:
                print("🏁 No further changes detected, stopping.")
            break

    return df
In [151]:
df_app3g = df_app3g[df_app3g['price'] > 99].copy()
In [152]:
gc.collect()
Out[152]:
0
In [153]:
df_app3g = fill_gearbox(df_app3g)
Filled 19 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥90% majority rule)

✅ Gearbox filling complete: 19 filled, 5525 still missing.
In [154]:
df_app = fill_all_missing_values(df_app3g, threshold = 0.75)
🌀 Iteration 1 starting...
  → Starting with 31,620 missing values in 'power'
✅ Filled 101 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (31,519 remaining, took 5.41s)
✅ Filled 91 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (31,428 remaining, took 12.76s)
✅ Filled 72 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (31,356 remaining, took 4.95s)
✅ Filled 145 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (31,211 remaining, took 3.74s)
✅ Filled 101 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (31,110 remaining, took 8.65s)
✅ Filled 62 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (31,048 remaining, took 3.55s)
✅ Filled 45 values in 'power' using ['brand', 'model', 'vehicletype'] (31,003 remaining, took 2.90s)
✅ Filled 11 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (30,992 remaining, took 4.14s)
✅ Filled 47 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (30,945 remaining, took 3.78s)
✅ Filled 93 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (30,852 remaining, took 8.22s)
✅ Filled 26 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (30,826 remaining, took 3.49s)
✅ Filled 23 values in 'power' using ['brand', 'model', 'year_bin'] (30,803 remaining, took 2.72s)
✅ Filled 116 values in 'power' using ['brand', 'model', 'registrationyear'] (30,687 remaining, took 5.28s)
✅ Filled 22 values in 'power' using ['brand', 'model'] (30,665 remaining, took 2.37s)
✅ Filled 16 values in 'power' using ['brand', 'vehicletype'] (30,649 remaining, took 2.45s)
✅ Filled 19 values in 'power' using ['brand', 'year_bin'] (30,630 remaining, took 2.28s)
✅ Filled 1 values in 'power' using ['brand', 'registrationyear'] (30,629 remaining, took 3.41s)
✅ Filled 14 values in 'power' using ['brand', 'gearbox'] (30,615 remaining, took 2.33s)
  → Starting with 26,591 missing values in 'vehicletype'
✅ Filled 10,146 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (16,445 remaining, took 10.68s)
✅ Filled 2,283 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (14,162 remaining, took 21.27s)
✅ Filled 3,820 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (10,342 remaining, took 9.68s)
✅ Filled 2,228 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (8,114 remaining, took 2.87s)
✅ Filled 1,685 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (6,429 remaining, took 5.63s)
✅ Filled 282 values in 'vehicletype' using ['brand', 'model', 'power'] (6,147 remaining, took 8.02s)
✅ Filled 517 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (5,630 remaining, took 2.80s)
✅ Filled 72 values in 'vehicletype' using ['brand', 'model'] (5,558 remaining, took 2.56s)
✅ Filled 5 values in 'vehicletype' using ['brand', 'year_bin'] (5,553 remaining, took 2.35s)
✅ Filled 138 values in 'vehicletype' using ['brand', 'registrationyear'] (5,415 remaining, took 3.59s)
✅ Filled 454 values in 'vehicletype' using ['brand', 'power'] (4,961 remaining, took 4.88s)
✅ Filled 6 values in 'vehicletype' using ['brand', 'gearbox'] (4,955 remaining, took 2.41s)
  → Starting with 11,269 missing values in 'model'
✅ Filled 2,651 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (8,618 remaining, took 11.07s)
✅ Filled 1,606 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (7,012 remaining, took 22.50s)
✅ Filled 765 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (6,247 remaining, took 9.64s)
✅ Filled 248 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (5,999 remaining, took 2.83s)
✅ Filled 342 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (5,657 remaining, took 5.69s)
✅ Filled 116 values in 'model' using ['brand', 'vehicletype', 'power'] (5,541 remaining, took 7.97s)
✅ Filled 56 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (5,485 remaining, took 2.77s)
✅ Filled 92 values in 'model' using ['brand', 'power'] (5,393 remaining, took 4.72s)
✅ Filled 6 values in 'model' using ['brand', 'year_bin'] (5,387 remaining, took 2.40s)
✅ Filled 38 values in 'model' using ['brand', 'registrationyear'] (5,349 remaining, took 3.58s)
  → Starting with 11,647 missing values in 'fueltype'
✅ Filled 6,346 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (5,301 remaining, took 14.24s)
✅ Filled 998 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'registrationyear'] (4,303 remaining, took 27.06s)
✅ Filled 1,418 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'gearbox'] (2,885 remaining, took 13.31s)
✅ Filled 1,299 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'year_bin'] (1,586 remaining, took 3.76s)
✅ Filled 288 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (1,298 remaining, took 8.66s)
✅ Filled 122 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'gearbox'] (1,176 remaining, took 3.55s)
✅ Filled 150 values in 'fueltype' using ['brand', 'model', 'power', 'year_bin'] (1,026 remaining, took 10.84s)
✅ Filled 134 values in 'fueltype' using ['brand', 'model', 'power', 'registrationyear'] (892 remaining, took 21.79s)
✅ Filled 105 values in 'fueltype' using ['brand', 'model', 'power', 'gearbox'] (787 remaining, took 9.74s)
✅ Filled 75 values in 'fueltype' using ['brand', 'model', 'power'] (712 remaining, took 7.89s)
✅ Filled 319 values in 'fueltype' using ['brand', 'model'] (393 remaining, took 2.46s)
✅ Filled 19 values in 'fueltype' using ['brand', 'year_bin'] (374 remaining, took 2.36s)
✅ Filled 26 values in 'fueltype' using ['brand', 'registrationyear'] (348 remaining, took 3.63s)
✅ Filled 9 values in 'fueltype' using ['brand', 'gearbox'] (339 remaining, took 2.43s)
🔁 Iteration 1 filled 39,869 total values

🌀 Iteration 2 starting...
  → Starting with 30,615 missing values in 'power'
✅ Filled 26 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (30,589 remaining, took 4.61s)
✅ Filled 63 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (30,526 remaining, took 10.83s)
✅ Filled 10 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (30,516 remaining, took 4.34s)
✅ Filled 205 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (30,311 remaining, took 3.61s)
✅ Filled 11 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (30,300 remaining, took 7.97s)
✅ Filled 765 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (29,535 remaining, took 3.34s)
✅ Filled 405 values in 'power' using ['brand', 'model', 'vehicletype'] (29,130 remaining, took 2.83s)
✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (29,129 remaining, took 3.48s)
✅ Filled 20 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (29,109 remaining, took 3.37s)
✅ Filled 14 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (29,095 remaining, took 7.17s)
✅ Filled 38 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (29,057 remaining, took 3.18s)
✅ Filled 247 values in 'power' using ['brand', 'model', 'year_bin'] (28,810 remaining, took 2.68s)
✅ Filled 114 values in 'power' using ['brand', 'model', 'registrationyear'] (28,696 remaining, took 5.20s)
✅ Filled 423 values in 'power' using ['brand', 'model'] (28,273 remaining, took 2.38s)
✅ Filled 1 values in 'power' using ['brand', 'vehicletype'] (28,272 remaining, took 2.41s)
  → Starting with 4,955 missing values in 'vehicletype'
✅ Filled 590 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (4,365 remaining, took 10.93s)
✅ Filled 277 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (4,088 remaining, took 21.86s)
✅ Filled 169 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (3,919 remaining, took 9.64s)
✅ Filled 296 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (3,623 remaining, took 2.79s)
✅ Filled 64 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (3,559 remaining, took 5.61s)
✅ Filled 38 values in 'vehicletype' using ['brand', 'model', 'power'] (3,521 remaining, took 8.03s)
✅ Filled 224 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (3,297 remaining, took 2.76s)
✅ Filled 17 values in 'vehicletype' using ['brand', 'model'] (3,280 remaining, took 2.46s)
✅ Filled 1 values in 'vehicletype' using ['brand', 'power'] (3,279 remaining, took 4.84s)
  → Starting with 5,349 missing values in 'model'
✅ Filled 128 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (5,221 remaining, took 11.08s)
✅ Filled 28 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (5,193 remaining, took 22.45s)
✅ Filled 20 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (5,173 remaining, took 9.89s)
✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (5,172 remaining, took 5.76s)
✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power'] (5,169 remaining, took 7.97s)
✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (5,168 remaining, took 2.68s)
✅ Filled 2 values in 'model' using ['brand', 'vehicletype'] (5,166 remaining, took 2.47s)
✅ Filled 1 values in 'model' using ['brand', 'power'] (5,165 remaining, took 4.88s)
  → Starting with 339 missing values in 'fueltype'
✅ Filled 10 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'gearbox'] (329 remaining, took 13.54s)
✅ Filled 1 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (328 remaining, took 8.69s)
🔁 Iteration 2 filled 4,214 total values

🌀 Iteration 3 starting...
  → Starting with 28,272 missing values in 'power'
✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (28,271 remaining, took 4.52s)
✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (28,270 remaining, took 10.85s)
✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (28,269 remaining, took 4.19s)
✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (28,268 remaining, took 3.50s)
✅ Filled 62 values in 'power' using ['brand', 'model', 'vehicletype'] (28,206 remaining, took 2.80s)
✅ Filled 11 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (28,195 remaining, took 3.28s)
✅ Filled 13 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (28,182 remaining, took 3.23s)
✅ Filled 645 values in 'power' using ['brand', 'model', 'year_bin'] (27,537 remaining, took 2.70s)
✅ Filled 13 values in 'power' using ['brand', 'model', 'registrationyear'] (27,524 remaining, took 5.20s)
  → Starting with 3,279 missing values in 'vehicletype'
✅ Filled 108 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (3,171 remaining, took 10.96s)
✅ Filled 138 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (3,033 remaining, took 21.72s)
✅ Filled 7 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (3,026 remaining, took 9.65s)
✅ Filled 108 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (2,918 remaining, took 2.82s)
✅ Filled 24 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (2,894 remaining, took 5.59s)
✅ Filled 2 values in 'vehicletype' using ['brand', 'model', 'power'] (2,892 remaining, took 7.90s)
✅ Filled 55 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (2,837 remaining, took 2.87s)
  → Starting with 5,165 missing values in 'model'
✅ Filled 41 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (5,124 remaining, took 11.03s)
✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (5,123 remaining, took 22.58s)
✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (5,122 remaining, took 9.63s)
✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (5,119 remaining, took 5.75s)
✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power'] (5,116 remaining, took 7.90s)
  → Starting with 328 missing values in 'fueltype'
✅ Filled 4 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (324 remaining, took 14.74s)
✅ Filled 1 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'registrationyear'] (323 remaining, took 27.65s)
🔁 Iteration 3 filled 1,244 total values

🌀 Iteration 4 starting...
  → Starting with 27,524 missing values in 'power'
✅ Filled 4 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (27,520 remaining, took 4.22s)
✅ Filled 94 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (27,426 remaining, took 3.51s)
  → Starting with 2,837 missing values in 'vehicletype'
✅ Filled 9 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (2,828 remaining, took 11.19s)
✅ Filled 17 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (2,811 remaining, took 22.20s)
✅ Filled 333 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (2,478 remaining, took 2.95s)
  → Starting with 5,116 missing values in 'model'
✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (5,114 remaining, took 11.32s)
✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (5,111 remaining, took 23.08s)
✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (5,108 remaining, took 10.25s)
  → Starting with 323 missing values in 'fueltype'
✅ Filled 1 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (322 remaining, took 9.03s)
🔁 Iteration 4 filled 466 total values

🌀 Iteration 5 starting...
  → Starting with 27,426 missing values in 'power'
✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (27,425 remaining, took 4.53s)
✅ Filled 13 values in 'power' using ['brand', 'model', 'registrationyear'] (27,412 remaining, took 5.68s)
  → Starting with 2,478 missing values in 'vehicletype'
  → Starting with 5,108 missing values in 'model'
  → Starting with 322 missing values in 'fueltype'
🔁 Iteration 5 filled 14 total values

🌀 Iteration 6 starting...
  → Starting with 27,412 missing values in 'power'
  → Starting with 2,478 missing values in 'vehicletype'
  → Starting with 5,108 missing values in 'model'
  → Starting with 322 missing values in 'fueltype'
🔁 Iteration 6 filled 0 total values
🏁 No further changes detected, stopping.
In [155]:
mask = (df_app['brand'] == 'citroen') & (df_app['model'] == 'c4') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
In [156]:
mask = (df_app['brand'] == 'renault') & (df_app['model'] == 'megane') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
In [157]:
mask = (df_app['brand'] == 'ford') & (df_app['model'] == 'fusion') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
In [158]:
mask = (df_app['brand'] == 'seat') & (df_app['model'] == 'leon') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
In [159]:
del mask
In [160]:
print(df_app.memory_usage(deep=True).sum() / 1_000_000, "MB")
240.465945 MB
In [161]:
del df_app3g
In [162]:
df_app1 = fill_gearbox(df_app, threshold = 0.75)
Filled 2775 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥75% majority rule)
Filled 456 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥75% majority rule)
Filled 52 missing gearbox values using ['brand', 'model', 'fueltype'] (≥75% majority rule)
Filled 561 missing gearbox values using ['brand', 'model'] (≥75% majority rule)
Filled 57 missing gearbox values using ['brand'] (≥75% majority rule)

✅ Gearbox filling complete: 3901 filled, 1624 still missing.
In [163]:
df_app1['pc_bin'] = df_app1['postalcode'].astype(str).str[0]

df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.75)
display(df_app1[df_app1['power'] == 0])

df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.75)
df_app1[df_app1['power'] == 0]

df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.75)
df_app1[df_app1['power'] == 0]

df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.75)
df_app1[df_app1['power'] == 0]

df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.75)
df_app1[df_app1['power'] == 0]
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
24 15/03/2016 20:59 245 sedan 1994.0 manual 0.0 golf 150000 2 petrol volkswagen no 2016-03-15 0 44145 17/03/2016 18:17 N 1990s 4
40 17/03/2016 07:56 4700 wagon 2005.0 manual 0.0 signum 150000 0 gasoline opel no 2016-03-17 0 88433 04/04/2016 04:17 N 2000s 8
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
53 08/03/2016 01:36 800 small 1993.0 manual 0.0 polo 150000 3 petrol volkswagen no 2016-08-03 0 8258 05/04/2016 23:46 N 1990s 8
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340778 14/03/2016 12:37 2800 wagon 2013.0 manual 0.0 passat 150000 0 gasoline volkswagen NaN 2016-03-14 0 45892 19/03/2016 23:46 N 2010_plus 4
340779 14/03/2016 22:37 5500 wagon 2013.0 auto 0.0 passat 150000 1 gasoline volkswagen no 2016-03-14 0 90441 15/03/2016 19:47 N 2010_plus 9
340784 15/03/2016 11:45 850 other 2013.0 manual 0.0 other 5000 0 petrol audi no 2016-03-15 0 86647 16/03/2016 07:17 N 2010_plus 8
340785 10/03/2016 19:42 1850 convertible 2013.0 auto 0.0 megane 150000 5 petrol renault no 2016-10-03 0 27432 06/04/2016 02:17 N 2010_plus 2
340793 07/04/2016 08:36 1670 convertible 2013.0 manual 0.0 megane 90000 0 petrol renault no 2016-07-04 0 12167 07/04/2016 08:36 N 2010_plus 1

25888 rows × 19 columns

Out[163]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
24 15/03/2016 20:59 245 sedan 1994.0 manual 0.0 golf 150000 2 petrol volkswagen no 2016-03-15 0 44145 17/03/2016 18:17 N 1990s 4
40 17/03/2016 07:56 4700 wagon 2005.0 manual 0.0 signum 150000 0 gasoline opel no 2016-03-17 0 88433 04/04/2016 04:17 N 2000s 8
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
53 08/03/2016 01:36 800 small 1993.0 manual 0.0 polo 150000 3 petrol volkswagen no 2016-08-03 0 8258 05/04/2016 23:46 N 1990s 8
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340772 15/03/2016 08:51 1300 sedan 2013.0 NaN 0.0 5er 150000 0 petrol bmw yes 2016-03-15 0 66130 27/03/2016 19:46 N 2010_plus 6
340773 08/03/2016 21:06 3400 wagon 2013.0 manual 0.0 passat 5000 10 gasoline volkswagen no 2016-08-03 0 35435 15/03/2016 11:45 N 2010_plus 3
340774 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8
340779 14/03/2016 22:37 5500 wagon 2013.0 auto 0.0 passat 150000 1 gasoline volkswagen no 2016-03-14 0 90441 15/03/2016 19:47 N 2010_plus 9
340784 15/03/2016 11:45 850 other 2013.0 manual 0.0 other 5000 0 petrol audi no 2016-03-15 0 86647 16/03/2016 07:17 N 2010_plus 8

22256 rows × 19 columns

In [164]:
gc.collect()
Out[164]:
0
In [165]:
df_app1 = fill_gearbox(df_app1, threshold = 0.6)
Filled 433 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥60% majority rule)
Filled 173 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥60% majority rule)
Filled 46 missing gearbox values using ['brand', 'model', 'fueltype'] (≥60% majority rule)
Filled 43 missing gearbox values using ['brand', 'model'] (≥60% majority rule)
Filled 165 missing gearbox values using ['brand'] (≥60% majority rule)

✅ Gearbox filling complete: 860 filled, 764 still missing.
In [166]:
gc.collect()
Out[166]:
0
In [167]:
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.6)
display(df_app1[df_app1['power'] == 0])

df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.6)
df_app1[df_app1['power'] == 0]

df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.6)
df_app1[df_app1['power'] == 0]

df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.6)
df_app1[df_app1['power'] == 0]

df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.6)
df_app1[df_app1['power'] == 0]
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
24 15/03/2016 20:59 245 sedan 1994.0 manual 0.0 golf 150000 2 petrol volkswagen no 2016-03-15 0 44145 17/03/2016 18:17 N 1990s 4
40 17/03/2016 07:56 4700 wagon 2005.0 manual 0.0 signum 150000 0 gasoline opel no 2016-03-17 0 88433 04/04/2016 04:17 N 2000s 8
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
53 08/03/2016 01:36 800 small 1993.0 manual 0.0 polo 150000 3 petrol volkswagen no 2016-08-03 0 8258 05/04/2016 23:46 N 1990s 8
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340772 15/03/2016 08:51 1300 sedan 2013.0 manual 0.0 5er 150000 0 petrol bmw yes 2016-03-15 0 66130 27/03/2016 19:46 N 2010_plus 6
340773 08/03/2016 21:06 3400 wagon 2013.0 manual 0.0 passat 5000 10 gasoline volkswagen no 2016-08-03 0 35435 15/03/2016 11:45 N 2010_plus 3
340774 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8
340779 14/03/2016 22:37 5500 wagon 2013.0 auto 0.0 passat 150000 1 gasoline volkswagen no 2016-03-14 0 90441 15/03/2016 19:47 N 2010_plus 9
340784 15/03/2016 11:45 850 other 2013.0 manual 0.0 other 5000 0 petrol audi no 2016-03-15 0 86647 16/03/2016 07:17 N 2010_plus 8

20858 rows × 19 columns

Out[167]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
24 15/03/2016 20:59 245 sedan 1994.0 manual 0.0 golf 150000 2 petrol volkswagen no 2016-03-15 0 44145 17/03/2016 18:17 N 1990s 4
40 17/03/2016 07:56 4700 wagon 2005.0 manual 0.0 signum 150000 0 gasoline opel no 2016-03-17 0 88433 04/04/2016 04:17 N 2000s 8
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
85 03/04/2016 03:57 350 small 1998.0 manual 0.0 corsa 150000 2 petrol opel NaN 2016-03-04 0 82110 03/04/2016 08:53 N 1990s 8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340770 09/03/2016 21:00 1150 small 2013.0 auto 0.0 fortwo 150000 11 petrol smart no 2016-09-03 0 47443 10/03/2016 07:46 N 2010_plus 4
340773 08/03/2016 21:06 3400 wagon 2013.0 manual 0.0 passat 5000 10 gasoline volkswagen no 2016-08-03 0 35435 15/03/2016 11:45 N 2010_plus 3
340774 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8
340779 14/03/2016 22:37 5500 wagon 2013.0 auto 0.0 passat 150000 1 gasoline volkswagen no 2016-03-14 0 90441 15/03/2016 19:47 N 2010_plus 9
340784 15/03/2016 11:45 850 other 2013.0 manual 0.0 other 5000 0 petrol audi no 2016-03-15 0 86647 16/03/2016 07:17 N 2010_plus 8

17893 rows × 19 columns

In [168]:
gc.collect()
Out[168]:
0
In [169]:
df_app2 = fill_all_missing_values(df_app1, threshold = 0.6)
🌀 Iteration 1 starting...
  → Starting with 17,893 missing values in 'power'
✅ Filled 161 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (17,732 remaining, took 4.59s)
✅ Filled 438 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (17,294 remaining, took 10.90s)
✅ Filled 110 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (17,184 remaining, took 4.13s)
✅ Filled 117 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (17,067 remaining, took 3.60s)
✅ Filled 73 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (16,994 remaining, took 8.12s)
✅ Filled 11 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (16,983 remaining, took 3.30s)
✅ Filled 181 values in 'power' using ['brand', 'model', 'vehicletype'] (16,802 remaining, took 2.93s)
✅ Filled 2 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (16,800 remaining, took 3.56s)
✅ Filled 57 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (16,743 remaining, took 3.52s)
✅ Filled 199 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (16,544 remaining, took 7.21s)
✅ Filled 61 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (16,483 remaining, took 3.12s)
✅ Filled 519 values in 'power' using ['brand', 'model', 'year_bin'] (15,964 remaining, took 2.84s)
✅ Filled 10 values in 'power' using ['brand', 'model', 'registrationyear'] (15,954 remaining, took 5.27s)
✅ Filled 208 values in 'power' using ['brand', 'model'] (15,746 remaining, took 2.42s)
✅ Filled 27 values in 'power' using ['brand', 'vehicletype'] (15,719 remaining, took 2.46s)
✅ Filled 2 values in 'power' using ['brand', 'year_bin'] (15,717 remaining, took 2.34s)
✅ Filled 3 values in 'power' using ['brand', 'registrationyear'] (15,714 remaining, took 3.54s)
  → Starting with 2,478 missing values in 'vehicletype'
✅ Filled 387 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (2,091 remaining, took 11.01s)
✅ Filled 471 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (1,620 remaining, took 21.73s)
✅ Filled 299 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (1,321 remaining, took 9.49s)
✅ Filled 265 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (1,056 remaining, took 2.88s)
✅ Filled 127 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (929 remaining, took 5.74s)
✅ Filled 35 values in 'vehicletype' using ['brand', 'model', 'power'] (894 remaining, took 8.09s)
✅ Filled 94 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (800 remaining, took 2.74s)
✅ Filled 29 values in 'vehicletype' using ['brand', 'model'] (771 remaining, took 2.52s)
✅ Filled 13 values in 'vehicletype' using ['brand', 'year_bin'] (758 remaining, took 2.49s)
✅ Filled 50 values in 'vehicletype' using ['brand', 'registrationyear'] (708 remaining, took 3.71s)
✅ Filled 22 values in 'vehicletype' using ['brand', 'power'] (686 remaining, took 5.05s)
  → Starting with 5,108 missing values in 'model'
✅ Filled 884 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (4,224 remaining, took 11.15s)
✅ Filled 584 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (3,640 remaining, took 22.49s)
✅ Filled 211 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (3,429 remaining, took 9.46s)
✅ Filled 71 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (3,358 remaining, took 2.89s)
✅ Filled 88 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (3,270 remaining, took 5.82s)
✅ Filled 13 values in 'model' using ['brand', 'vehicletype', 'power'] (3,257 remaining, took 7.95s)
✅ Filled 70 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (3,187 remaining, took 2.74s)
✅ Filled 1 values in 'model' using ['brand', 'vehicletype'] (3,186 remaining, took 2.60s)
✅ Filled 21 values in 'model' using ['brand', 'power'] (3,165 remaining, took 4.93s)
✅ Filled 1 values in 'model' using ['brand', 'registrationyear'] (3,164 remaining, took 3.68s)
✅ Filled 13 values in 'model' using ['brand', 'gearbox'] (3,151 remaining, took 2.42s)
  → Starting with 322 missing values in 'fueltype'
✅ Filled 234 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (88 remaining, took 14.43s)
✅ Filled 23 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'registrationyear'] (65 remaining, took 27.04s)
✅ Filled 12 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'gearbox'] (53 remaining, took 13.08s)
✅ Filled 29 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'year_bin'] (24 remaining, took 3.77s)
✅ Filled 13 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (11 remaining, took 8.47s)
✅ Filled 4 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'gearbox'] (7 remaining, took 3.44s)
✅ Filled 2 values in 'fueltype' using ['brand', 'model', 'power', 'registrationyear'] (5 remaining, took 21.49s)
✅ Filled 1 values in 'fueltype' using ['brand', 'year_bin'] (4 remaining, took 2.49s)
✅ Filled 2 values in 'fueltype' using ['brand', 'registrationyear'] (2 remaining, took 3.75s)
🔁 Iteration 1 filled 6,248 total values

🌀 Iteration 2 starting...
  → Starting with 15,714 missing values in 'power'
✅ Filled 53 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (15,661 remaining, took 4.47s)
✅ Filled 37 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (15,624 remaining, took 10.68s)
✅ Filled 13 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (15,611 remaining, took 4.09s)
✅ Filled 8 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (15,603 remaining, took 3.50s)
✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (15,600 remaining, took 7.94s)
✅ Filled 171 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (15,429 remaining, took 3.26s)
✅ Filled 331 values in 'power' using ['brand', 'model', 'vehicletype'] (15,098 remaining, took 2.89s)
✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (15,097 remaining, took 3.51s)
✅ Filled 19 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (15,078 remaining, took 3.28s)
✅ Filled 2 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (15,076 remaining, took 7.16s)
✅ Filled 67 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (15,009 remaining, took 3.17s)
✅ Filled 1,095 values in 'power' using ['brand', 'model', 'year_bin'] (13,914 remaining, took 2.80s)
✅ Filled 28 values in 'power' using ['brand', 'model', 'registrationyear'] (13,886 remaining, took 5.31s)
✅ Filled 413 values in 'power' using ['brand', 'model'] (13,473 remaining, took 2.49s)
  → Starting with 686 missing values in 'vehicletype'
✅ Filled 22 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (664 remaining, took 11.00s)
✅ Filled 12 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (652 remaining, took 5.61s)
✅ Filled 197 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (455 remaining, took 2.89s)
✅ Filled 2 values in 'vehicletype' using ['brand', 'model'] (453 remaining, took 2.64s)
  → Starting with 3,151 missing values in 'model'
✅ Filled 20 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (3,131 remaining, took 11.05s)
✅ Filled 21 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (3,110 remaining, took 22.27s)
✅ Filled 64 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (3,046 remaining, took 9.52s)
✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power'] (3,043 remaining, took 7.88s)
  → Starting with 2 missing values in 'fueltype'
🔁 Iteration 2 filled 2,582 total values

🌀 Iteration 3 starting...
  → Starting with 13,473 missing values in 'power'
✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype'] (13,470 remaining, took 2.88s)
✅ Filled 3 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (13,467 remaining, took 3.38s)
✅ Filled 64 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (13,403 remaining, took 3.07s)
✅ Filled 597 values in 'power' using ['brand', 'model', 'year_bin'] (12,806 remaining, took 2.77s)
  → Starting with 453 missing values in 'vehicletype'
✅ Filled 5 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (448 remaining, took 10.93s)
✅ Filled 6 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (442 remaining, took 21.49s)
✅ Filled 8 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (434 remaining, took 9.45s)
  → Starting with 3,043 missing values in 'model'
✅ Filled 5 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (3,038 remaining, took 11.10s)
✅ Filled 5 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (3,033 remaining, took 9.32s)
✅ Filled 2 values in 'model' using ['brand', 'power'] (3,031 remaining, took 4.92s)
  → Starting with 2 missing values in 'fueltype'
🔁 Iteration 3 filled 698 total values

🌀 Iteration 4 starting...
  → Starting with 12,806 missing values in 'power'
  → Starting with 434 missing values in 'vehicletype'
✅ Filled 42 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (392 remaining, took 2.85s)
✅ Filled 5 values in 'vehicletype' using ['brand', 'model'] (387 remaining, took 2.55s)
  → Starting with 3,031 missing values in 'model'
✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (3,029 remaining, took 9.28s)
  → Starting with 2 missing values in 'fueltype'
🔁 Iteration 4 filled 49 total values

🌀 Iteration 5 starting...
  → Starting with 12,806 missing values in 'power'
  → Starting with 387 missing values in 'vehicletype'
  → Starting with 3,029 missing values in 'model'
  → Starting with 2 missing values in 'fueltype'
🔁 Iteration 5 filled 0 total values
🏁 No further changes detected, stopping.
In [170]:
del df_app1
gc.collect()
Out[170]:
0
In [171]:
mask = (df_app2['brand'] == 'volkswagen') & (df_app2['model'] == 'golf') & (df_app2['vehicletype'] == 'bus')
df_app2.loc[mask,['vehicletype']] = np.nan

mask = (df_app2['brand'] == 'volkswagen') & (df_app2['model'] == 'golf') & (df_app2['vehicletype'].isna())
df_app2.loc[mask,['vehicletype']] = 'other'
In [172]:
mask = (df_app2['brand'] == 'mercedes_benz') & (df_app2['model'] == 'a_klasse') & (df_app2['vehicletype'] == 'bus')
df_app2.loc[mask,['vehicletype']] = 'small'

mask = (df_app2['brand'] == 'smart') & (df_app2['model'] == 'fortwo') & (df_app2['vehicletype'] == 'bus')
df_app2.loc[mask,['vehicletype']] = 'small'

mask = (df_app2['brand'] == 'lada') & (df_app2['model'] == 'niva') & (df_app2['vehicletype'] == 'bus')
df_app2.loc[mask,['vehicletype']] = 'suv'
In [173]:
mask = (df_app2['brand'] == 'volkswagen') & (df_app2['model'] == 'transporter') & (df_app2['vehicletype'] == 'wagon')
df_app2.loc[mask,['vehicletype']] = 'bus'
In [174]:
del mask
gc.collect()
Out[174]:
0
In [175]:
print(df_app2.memory_usage(deep=True).sum() / 1_000_000, "MB")
260.514323 MB
In [176]:
df_app2 = fill_gearbox(df_app2, threshold = 0.6)
Filled 58 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥60% majority rule)
Filled 26 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥60% majority rule)

✅ Gearbox filling complete: 84 filled, 680 still missing.
In [177]:
df_ap4 = df_app2.drop_duplicates()
In [178]:
df_ap4[df_ap4['power'] == 0]
Out[178]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
24 15/03/2016 20:59 245 sedan 1994.0 manual 0.0 golf 150000 2 petrol volkswagen no 2016-03-15 0 44145 17/03/2016 18:17 N 1990s 4
40 17/03/2016 07:56 4700 wagon 2005.0 manual 0.0 signum 150000 0 gasoline opel no 2016-03-17 0 88433 04/04/2016 04:17 N 2000s 8
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
85 03/04/2016 03:57 350 small 1998.0 manual 0.0 corsa 150000 2 petrol opel NaN 2016-03-04 0 82110 03/04/2016 08:53 N 1990s 8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340761 10/03/2016 22:49 1200 small 2013.0 manual 0.0 i_reihe 50000 12 petrol hyundai NaN 2016-10-03 0 6493 14/03/2016 09:16 N 2010_plus 6
340769 11/03/2016 03:03 1500 coupe 2013.0 NaN 0.0 NaN 5000 0 petrol sonstige_autos NaN 2016-11-03 0 40476 06/04/2016 04:44 N 2010_plus 4
340770 09/03/2016 21:00 1150 small 2013.0 auto 0.0 fortwo 150000 11 petrol smart no 2016-09-03 0 47443 10/03/2016 07:46 N 2010_plus 4
340774 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8
340784 15/03/2016 11:45 850 other 2013.0 manual 0.0 other 5000 0 petrol audi no 2016-03-15 0 86647 16/03/2016 07:17 N 2010_plus 8

12806 rows × 19 columns

In [179]:
df_app4 = fill_zero_power(df_ap4,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.6)
display(df_app4[df_app4['power'] == 0])

df_app4 = fill_zero_power(df_app4,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.6)
df_app4[df_app4['power'] == 0]

df_app4 = fill_zero_power(df_app4,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.6)
df_app4[df_app4['power'] == 0]


df_app4 = fill_zero_power(df_app4,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.6)
df_app4[df_app4['power'] == 0]

df_app4 = fill_zero_power(df_app4,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.6)
display(df_app4[df_app4['power'] == 0])
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
24 15/03/2016 20:59 245 sedan 1994.0 manual 0.0 golf 150000 2 petrol volkswagen no 2016-03-15 0 44145 17/03/2016 18:17 N 1990s 4
40 17/03/2016 07:56 4700 wagon 2005.0 manual 0.0 signum 150000 0 gasoline opel no 2016-03-17 0 88433 04/04/2016 04:17 N 2000s 8
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
85 03/04/2016 03:57 350 small 1998.0 manual 0.0 corsa 150000 2 petrol opel NaN 2016-03-04 0 82110 03/04/2016 08:53 N 1990s 8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340740 10/03/2016 22:49 1200 small 2013.0 manual 0.0 i_reihe 50000 12 petrol hyundai NaN 2016-10-03 0 6493 14/03/2016 09:16 N 2010_plus 6
340748 11/03/2016 03:03 1500 coupe 2013.0 NaN 0.0 NaN 5000 0 petrol sonstige_autos NaN 2016-11-03 0 40476 06/04/2016 04:44 N 2010_plus 4
340749 09/03/2016 21:00 1150 small 2013.0 auto 0.0 fortwo 150000 11 petrol smart no 2016-09-03 0 47443 10/03/2016 07:46 N 2010_plus 4
340753 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8
340763 15/03/2016 11:45 850 other 2013.0 manual 0.0 other 5000 0 petrol audi no 2016-03-15 0 86647 16/03/2016 07:17 N 2010_plus 8

12735 rows × 19 columns

datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
24 15/03/2016 20:59 245 sedan 1994.0 manual 0.0 golf 150000 2 petrol volkswagen no 2016-03-15 0 44145 17/03/2016 18:17 N 1990s 4
40 17/03/2016 07:56 4700 wagon 2005.0 manual 0.0 signum 150000 0 gasoline opel no 2016-03-17 0 88433 04/04/2016 04:17 N 2000s 8
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
85 03/04/2016 03:57 350 small 1998.0 manual 0.0 corsa 150000 2 petrol opel NaN 2016-03-04 0 82110 03/04/2016 08:53 N 1990s 8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340740 10/03/2016 22:49 1200 small 2013.0 manual 0.0 i_reihe 50000 12 petrol hyundai NaN 2016-10-03 0 6493 14/03/2016 09:16 N 2010_plus 6
340748 11/03/2016 03:03 1500 coupe 2013.0 NaN 0.0 NaN 5000 0 petrol sonstige_autos NaN 2016-11-03 0 40476 06/04/2016 04:44 N 2010_plus 4
340749 09/03/2016 21:00 1150 small 2013.0 auto 0.0 fortwo 150000 11 petrol smart no 2016-09-03 0 47443 10/03/2016 07:46 N 2010_plus 4
340753 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8
340763 15/03/2016 11:45 850 other 2013.0 manual 0.0 other 5000 0 petrol audi no 2016-03-15 0 86647 16/03/2016 07:17 N 2010_plus 8

12516 rows × 19 columns

In [180]:
df_app5 = fill_missing_models_majority_x(df_app4, threshold = 0.6)
✅ Filled 5 missing models (threshold=60%)
In [181]:
df_app6 = fill_missing_models_majority(df_app5, threshold = 0.6)
In [182]:
df_app7 = df_app6[df_app6['vehicletype'].notna()]
In [183]:
df_app8 = df_app7[df_app7['gearbox'].notna()]
In [184]:
df_app9 = fill_zero_power(df_app8,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.55)
display(df_app9[df_app9['power'] == 0])

df_app9 = fill_zero_power(df_app9,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.55)
df_app9[df_app9['power'] == 0]


df_app9 = fill_zero_power(df_app9,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.55)
df_app9[df_app9['power'] == 0]

df_app9 = fill_zero_power(df_app9,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.55)
df_app9[df_app9['power'] == 0]


df_app9 = fill_zero_power(df_app9,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.55)
df_app9[df_app9['power'] == 0]
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
24 15/03/2016 20:59 245 sedan 1994.0 manual 0.0 golf 150000 2 petrol volkswagen no 2016-03-15 0 44145 17/03/2016 18:17 N 1990s 4
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
85 03/04/2016 03:57 350 small 1998.0 manual 0.0 corsa 150000 2 petrol opel NaN 2016-03-04 0 82110 03/04/2016 08:53 N 1990s 8
122 01/04/2016 16:06 800 sedan 1993.0 manual 0.0 golf 10000 9 petrol volkswagen yes 2016-01-04 0 65929 07/04/2016 11:17 N 1990s 6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339969 01/04/2016 12:38 1299 coupe 2013.0 auto 0.0 NaN 5000 0 petrol sonstige_autos no 2016-01-04 0 48703 01/04/2016 12:38 N 2010_plus 4
339973 05/04/2016 02:36 1500 coupe 2013.0 manual 0.0 NaN 5000 11 petrol sonstige_autos no 2016-05-04 0 27474 05/04/2016 08:46 N 2010_plus 2
339974 09/03/2016 12:58 700 coupe 2013.0 manual 0.0 NaN 100000 1 petrol sonstige_autos no 2016-09-03 0 51570 05/04/2016 01:16 N 2010_plus 5
339976 20/03/2016 23:49 3000 coupe 2013.0 manual 0.0 NaN 150000 0 gasoline sonstige_autos NaN 2016-03-20 0 85072 23/03/2016 11:17 N 2010_plus 8
339978 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8

11827 rows × 19 columns

Out[184]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
24 15/03/2016 20:59 245 sedan 1994.0 manual 0.0 golf 150000 2 petrol volkswagen no 2016-03-15 0 44145 17/03/2016 18:17 N 1990s 4
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
85 03/04/2016 03:57 350 small 1998.0 manual 0.0 corsa 150000 2 petrol opel NaN 2016-03-04 0 82110 03/04/2016 08:53 N 1990s 8
122 01/04/2016 16:06 800 sedan 1993.0 manual 0.0 golf 10000 9 petrol volkswagen yes 2016-01-04 0 65929 07/04/2016 11:17 N 1990s 6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339969 01/04/2016 12:38 1299 coupe 2013.0 auto 0.0 NaN 5000 0 petrol sonstige_autos no 2016-01-04 0 48703 01/04/2016 12:38 N 2010_plus 4
339973 05/04/2016 02:36 1500 coupe 2013.0 manual 0.0 NaN 5000 11 petrol sonstige_autos no 2016-05-04 0 27474 05/04/2016 08:46 N 2010_plus 2
339974 09/03/2016 12:58 700 coupe 2013.0 manual 0.0 NaN 100000 1 petrol sonstige_autos no 2016-09-03 0 51570 05/04/2016 01:16 N 2010_plus 5
339976 20/03/2016 23:49 3000 coupe 2013.0 manual 0.0 NaN 150000 0 gasoline sonstige_autos NaN 2016-03-20 0 85072 23/03/2016 11:17 N 2010_plus 8
339978 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8

10919 rows × 19 columns

In [185]:
def fill_all_missing_values_mp(df, repeat_until_change=True, threshold=0.7, verbose=True, max_iterations=10):
    """
    Fill missing values for power and model using tiered group strategies.
    Optimized version with better memory management and early stopping.
    """
    df = df.copy()

    def safe_mode(series):
        """Return mode if confident enough (>= threshold), else NaN."""
        s = series.dropna()
        if len(s) == 0:
            return np.nan
        counts = s.value_counts(normalize=True)
        if len(counts) == 0:
            return np.nan
        top_val, top_freq = counts.index[0], counts.iloc[0]
        return top_val if top_freq >= threshold else np.nan

    def is_zero_condition(condition):
        """Heuristic to detect if condition checks for zero (lambda x: x == 0)."""
        try:
            test = condition(pd.Series([0, np.nan], dtype=object))
            if isinstance(test, (bool, np.bool_)) and test:
                return True
            if hasattr(test, "__len__") and len(test) >= 1:
                return bool(test.iloc[0] if hasattr(test, 'iloc') else test[0])
        except Exception:
            pass
        return False

    def make_key_tuple(row_vals):
        """Helper: convert list-like row values to a hashable tuple with None for NaN."""
        return tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row_vals)

    def fill_column(target_col, fill_strategies, condition=lambda x: x.isna()):
        total_filled = 0
        zero_check = is_zero_condition(condition)

        # Track initial state
        if zero_check:
            initial_missing = (df[target_col] == 0).sum()
        else:
            initial_missing = df[target_col].isna().sum()
        
        if initial_missing == 0:
            return 0
        
        if verbose:
            print(f"  → Starting with {initial_missing:,} missing values in '{target_col}'")

        for cols in fill_strategies:
            # Check if there's still work to do
            if zero_check:
                current_missing = (df[target_col] == 0).sum()
            else:
                current_missing = df[target_col].isna().sum()
            
            if current_missing == 0:
                break
            
            start_time = time.time()

            try:
                # Compute group modes using safe_mode
                group_modes = (
                    df.groupby(cols, dropna=False)[target_col]
                    .apply(safe_mode)
                    .reset_index()
                    .rename(columns={target_col: 'fill_value'})
                )
                
                # Remove groups with no valid fill value
                group_modes = group_modes[group_modes['fill_value'].notna()]
                
                if len(group_modes) == 0:
                    continue

            except Exception as e:
                if verbose:
                    print(f"⚠️ Skipping strategy {cols} — groupby failed: {str(e)[:50]}")
                continue

            # Build mapping dict from group_modes
            keys = group_modes[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
            mapping = dict(zip(keys, group_modes['fill_value'].values))

            # Compute fill_value per-row by mapping (keeps original row order)
            row_keys = df[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
            fill_series = row_keys.map(mapping)

            # Create mask of rows that need filling AND have a candidate fill_value
            mask_need = condition(df[target_col])
            mask_candidate = fill_series.notna()
            mask = mask_need & mask_candidate

            # Count before
            if zero_check:
                before_missing = (df[target_col] == 0).sum()
            else:
                before_missing = df[target_col].isna().sum()

            # Perform fill
            if mask.any():
                df.loc[mask, target_col] = fill_series.loc[mask].values

            # Count after
            if zero_check:
                after_missing = (df[target_col] == 0).sum()
            else:
                after_missing = df[target_col].isna().sum()

            filled_now = before_missing - after_missing
            total_filled += int(filled_now)

            if verbose and filled_now > 0:
                elapsed = time.time() - start_time
                print(f"✅ Filled {int(filled_now):,} values in '{target_col}' using {cols} ({after_missing:,} remaining, took {elapsed:.2f}s)")

        return total_filled

    iteration = 0
    while iteration < max_iterations:
        iteration += 1
        total_filled = 0
        if verbose:
            print(f"\n🌀 Iteration {iteration} starting...")

        # --- POWER ---
        power_strategies = [
            ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'],
            ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
            ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'],
            ['brand', 'model', 'vehicletype', 'year_bin'],
            ['brand', 'model', 'vehicletype', 'registrationyear'],
            ['brand', 'model', 'vehicletype', 'gearbox'],
            ['brand', 'model', 'vehicletype'],
            ['brand', 'model', 'fueltype', 'vehicletype'],
            ['brand', 'model', 'fueltype', 'year_bin'],
            ['brand', 'model', 'fueltype', 'registrationyear'],
            ['brand', 'model', 'fueltype', 'gearbox'],
            ['brand', 'model', 'year_bin'],
            ['brand', 'model', 'registrationyear'],
            ['brand', 'model'],
            ['brand', 'vehicletype'],
            ['brand', 'year_bin'],
            ['brand', 'registrationyear'],
            ['brand', 'gearbox'],
            ['brand']
        ]
        total_filled += fill_column('power', power_strategies, condition=lambda x: x == 0)

        # --- MODEL ---
        model_strategies = [
            ['brand', 'vehicletype', 'power', 'year_bin'],
            ['brand', 'vehicletype', 'power', 'registrationyear'],
            ['brand', 'vehicletype', 'power', 'gearbox'],
            ['brand', 'vehicletype', 'year_bin'],
            ['brand', 'vehicletype', 'registrationyear'],
            ['brand', 'vehicletype', 'power'],
            ['brand', 'vehicletype', 'gearbox'],
            ['brand', 'vehicletype'],
            ['brand', 'power'],
            ['brand', 'year_bin'],
            ['brand', 'registrationyear'],
            ['brand', 'gearbox'],
            ['brand']
        ]
        total_filled += fill_column('model', model_strategies)

        if verbose:
            print(f"🔁 Iteration {iteration} filled {total_filled:,} total values")

        if not repeat_until_change or total_filled == 0:
            if verbose:
                print("🏁 No further changes detected, stopping.")
            break

    return df
In [186]:
df_app10 = fill_all_missing_values_mp(df_app9, threshold = 0.55)
🌀 Iteration 1 starting...
  → Starting with 10,919 missing values in 'power'
✅ Filled 56 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (10,863 remaining, took 4.43s)
✅ Filled 253 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (10,610 remaining, took 10.56s)
✅ Filled 55 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (10,555 remaining, took 3.89s)
✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (10,552 remaining, took 3.47s)
✅ Filled 15 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (10,537 remaining, took 7.86s)
✅ Filled 16 values in 'power' using ['brand', 'model', 'vehicletype'] (10,521 remaining, took 2.83s)
✅ Filled 31 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (10,490 remaining, took 3.44s)
✅ Filled 134 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (10,356 remaining, took 7.12s)
✅ Filled 8 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (10,348 remaining, took 3.16s)
✅ Filled 3 values in 'power' using ['brand', 'model', 'registrationyear'] (10,345 remaining, took 5.16s)
✅ Filled 2 values in 'power' using ['brand', 'model'] (10,343 remaining, took 2.39s)
✅ Filled 2 values in 'power' using ['brand', 'registrationyear'] (10,341 remaining, took 3.43s)
  → Starting with 2,395 missing values in 'model'
✅ Filled 57 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (2,338 remaining, took 11.02s)
✅ Filled 45 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (2,293 remaining, took 22.31s)
✅ Filled 25 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (2,268 remaining, took 9.40s)
✅ Filled 11 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (2,257 remaining, took 2.83s)
✅ Filled 34 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (2,223 remaining, took 5.75s)
✅ Filled 5 values in 'model' using ['brand', 'vehicletype', 'power'] (2,218 remaining, took 7.88s)
✅ Filled 41 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (2,177 remaining, took 2.72s)
✅ Filled 2 values in 'model' using ['brand', 'registrationyear'] (2,175 remaining, took 3.62s)
✅ Filled 1 values in 'model' using ['brand'] (2,174 remaining, took 1.94s)
🔁 Iteration 1 filled 799 total values

🌀 Iteration 2 starting...
  → Starting with 10,341 missing values in 'power'
✅ Filled 3 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (10,338 remaining, took 4.46s)
✅ Filled 21 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (10,317 remaining, took 3.44s)
✅ Filled 4 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (10,313 remaining, took 7.82s)
✅ Filled 257 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (10,056 remaining, took 3.15s)
✅ Filled 187 values in 'power' using ['brand', 'model', 'vehicletype'] (9,869 remaining, took 2.70s)
✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (9,868 remaining, took 3.32s)
✅ Filled 1,445 values in 'power' using ['brand', 'model', 'year_bin'] (8,423 remaining, took 2.68s)
✅ Filled 2 values in 'power' using ['brand', 'model', 'registrationyear'] (8,421 remaining, took 5.15s)
✅ Filled 293 values in 'power' using ['brand', 'model'] (8,128 remaining, took 2.43s)
  → Starting with 2,174 missing values in 'model'
✅ Filled 4 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (2,170 remaining, took 10.96s)
✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (2,168 remaining, took 22.17s)
🔁 Iteration 2 filled 2,219 total values

🌀 Iteration 3 starting...
  → Starting with 8,128 missing values in 'power'
✅ Filled 63 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (8,065 remaining, took 3.46s)
✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (8,062 remaining, took 3.35s)
✅ Filled 40 values in 'power' using ['brand', 'model', 'vehicletype'] (8,022 remaining, took 2.84s)
  → Starting with 2,168 missing values in 'model'
✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (2,166 remaining, took 11.15s)
✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (2,165 remaining, took 22.29s)
✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (2,162 remaining, took 9.41s)
🔁 Iteration 3 filled 112 total values

🌀 Iteration 4 starting...
  → Starting with 8,022 missing values in 'power'
  → Starting with 2,162 missing values in 'model'
🔁 Iteration 4 filled 0 total values
🏁 No further changes detected, stopping.
In [187]:
gc.collect()
Out[187]:
0
In [188]:
df_app11 = fill_zero_power(df_app10,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.55)
display(df_app11[df_app11['power'] == 0])

df_app11 = fill_zero_power(df_app11,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.55)
df_app11[df_app11['power'] == 0]


df_app11 = fill_zero_power(df_app11,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.55)
df_app11[df_app11['power'] == 0]

df_app11 = fill_zero_power(df_app11,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.55)
df_app11[df_app11['power'] == 0]


df_app11 = fill_zero_power(df_app11,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.55)
df_app11[df_app11['power'] == 0]
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
85 03/04/2016 03:57 350 small 1998.0 manual 0.0 corsa 150000 2 petrol opel NaN 2016-03-04 0 82110 03/04/2016 08:53 N 1990s 8
122 01/04/2016 16:06 800 sedan 1993.0 manual 0.0 golf 10000 9 petrol volkswagen yes 2016-01-04 0 65929 07/04/2016 11:17 N 1990s 6
141 12/03/2016 17:47 2999 wagon 2001.0 manual 0.0 3er 150000 7 petrol bmw NaN 2016-12-03 0 45891 07/04/2016 09:17 N 2000s 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339969 01/04/2016 12:38 1299 coupe 2013.0 auto 0.0 NaN 5000 0 petrol sonstige_autos no 2016-01-04 0 48703 01/04/2016 12:38 N 2010_plus 4
339973 05/04/2016 02:36 1500 coupe 2013.0 manual 0.0 NaN 5000 11 petrol sonstige_autos no 2016-05-04 0 27474 05/04/2016 08:46 N 2010_plus 2
339974 09/03/2016 12:58 700 coupe 2013.0 manual 0.0 NaN 100000 1 petrol sonstige_autos no 2016-09-03 0 51570 05/04/2016 01:16 N 2010_plus 5
339976 20/03/2016 23:49 3000 coupe 2013.0 manual 0.0 NaN 150000 0 gasoline sonstige_autos NaN 2016-03-20 0 85072 23/03/2016 11:17 N 2010_plus 8
339978 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8

7949 rows × 19 columns

Out[188]:
datecrawled price vehicletype registrationyear gearbox power model mileage registrationmonth fueltype brand notrepaired datecreated numberofpictures postalcode lastseen registration_correction year_bin pc_bin
52 01/04/2016 11:56 1200 coupe 2001.0 manual 0.0 astra 150000 0 petrol opel NaN 2016-01-04 0 47249 07/04/2016 08:46 N 2000s 4
68 23/03/2016 11:53 2400 sedan 2003.0 manual 0.0 a4 150000 9 gasoline audi NaN 2016-03-23 0 40210 23/03/2016 11:53 N 2000s 4
85 03/04/2016 03:57 350 small 1998.0 manual 0.0 corsa 150000 2 petrol opel NaN 2016-03-04 0 82110 03/04/2016 08:53 N 1990s 8
122 01/04/2016 16:06 800 sedan 1993.0 manual 0.0 golf 10000 9 petrol volkswagen yes 2016-01-04 0 65929 07/04/2016 11:17 N 1990s 6
141 12/03/2016 17:47 2999 wagon 2001.0 manual 0.0 3er 150000 7 petrol bmw NaN 2016-12-03 0 45891 07/04/2016 09:17 N 2000s 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339969 01/04/2016 12:38 1299 coupe 2013.0 auto 0.0 NaN 5000 0 petrol sonstige_autos no 2016-01-04 0 48703 01/04/2016 12:38 N 2010_plus 4
339973 05/04/2016 02:36 1500 coupe 2013.0 manual 0.0 NaN 5000 11 petrol sonstige_autos no 2016-05-04 0 27474 05/04/2016 08:46 N 2010_plus 2
339974 09/03/2016 12:58 700 coupe 2013.0 manual 0.0 NaN 100000 1 petrol sonstige_autos no 2016-09-03 0 51570 05/04/2016 01:16 N 2010_plus 5
339976 20/03/2016 23:49 3000 coupe 2013.0 manual 0.0 NaN 150000 0 gasoline sonstige_autos NaN 2016-03-20 0 85072 23/03/2016 11:17 N 2010_plus 8
339978 14/03/2016 19:40 5999 coupe 2013.0 manual 0.0 NaN 150000 12 gasoline sonstige_autos NaN 2016-03-14 0 89081 19/03/2016 07:47 N 2010_plus 8

7929 rows × 19 columns

In [189]:
df_app12 = df_app11[df_app11['model'].notna()]
In [190]:
def fill_missing_power(df, repeat_until_change=True, threshold=0.7, verbose=True, max_iterations=10):
    """
    Fill missing power values (where power == 0) using tiered group strategies.
    Optimized version with better memory management and early stopping.
    """
    df = df.copy()

    def safe_mode(series):
        """Return mode if confident enough (>= threshold), else NaN."""
        s = series.dropna()
        if len(s) == 0:
            return np.nan
        counts = s.value_counts(normalize=True)
        if len(counts) == 0:
            return np.nan
        top_val, top_freq = counts.index[0], counts.iloc[0]
        return top_val if top_freq >= threshold else np.nan

    def is_zero_condition(condition):
        """Heuristic to detect if condition checks for zero (lambda x: x == 0)."""
        try:
            test = condition(pd.Series([0, np.nan], dtype=object))
            if isinstance(test, (bool, np.bool_)) and test:
                return True
            if hasattr(test, "__len__") and len(test) >= 1:
                return bool(test.iloc[0] if hasattr(test, 'iloc') else test[0])
        except Exception:
            pass
        return False

    def make_key_tuple(row_vals):
        """Helper: convert list-like row values to a hashable tuple with None for NaN."""
        return tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row_vals)

    def fill_column(target_col, fill_strategies, condition=lambda x: x.isna()):
        total_filled = 0
        zero_check = is_zero_condition(condition)

        # Track initial state
        if zero_check:
            initial_missing = (df[target_col] == 0).sum()
        else:
            initial_missing = df[target_col].isna().sum()
        
        if initial_missing == 0:
            return 0
        
        if verbose:
            print(f"  → Starting with {initial_missing:,} missing values in '{target_col}'")

        for cols in fill_strategies:
            # Check if there's still work to do
            if zero_check:
                current_missing = (df[target_col] == 0).sum()
            else:
                current_missing = df[target_col].isna().sum()
            
            if current_missing == 0:
                break
            
            start_time = time.time()

            try:
                # Compute group modes using safe_mode
                group_modes = (
                    df.groupby(cols, dropna=False)[target_col]
                    .apply(safe_mode)
                    .reset_index()
                    .rename(columns={target_col: 'fill_value'})
                )
                
                # Remove groups with no valid fill value
                group_modes = group_modes[group_modes['fill_value'].notna()]
                
                if len(group_modes) == 0:
                    continue

            except Exception as e:
                if verbose:
                    print(f"⚠️ Skipping strategy {cols} — groupby failed: {str(e)[:50]}")
                continue

            # Build mapping dict from group_modes
            keys = group_modes[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
            mapping = dict(zip(keys, group_modes['fill_value'].values))

            # Compute fill_value per-row by mapping (keeps original row order)
            row_keys = df[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
            fill_series = row_keys.map(mapping)

            # Create mask of rows that need filling AND have a candidate fill_value
            mask_need = condition(df[target_col])
            mask_candidate = fill_series.notna()
            mask = mask_need & mask_candidate

            # Count before
            if zero_check:
                before_missing = (df[target_col] == 0).sum()
            else:
                before_missing = df[target_col].isna().sum()

            # Perform fill
            if mask.any():
                df.loc[mask, target_col] = fill_series.loc[mask].values

            # Count after
            if zero_check:
                after_missing = (df[target_col] == 0).sum()
            else:
                after_missing = df[target_col].isna().sum()

            filled_now = before_missing - after_missing
            total_filled += int(filled_now)

            if verbose and filled_now > 0:
                elapsed = time.time() - start_time
                print(f"✅ Filled {int(filled_now):,} values in '{target_col}' using {cols} ({after_missing:,} remaining, took {elapsed:.2f}s)")

        return total_filled

    iteration = 0
    while iteration < max_iterations:
        iteration += 1
        total_filled = 0
        if verbose:
            print(f"\n🌀 Iteration {iteration} starting...")

        # --- POWER ONLY ---
        power_strategies = [
            ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'],
            ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'],
            ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'],
            ['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'],
            ['brand', 'model', 'vehicletype', 'pc_bin'],
            ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'],
            ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
            ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'],
            ['brand', 'model', 'vehicletype', 'year_bin'],
            ['brand', 'model', 'vehicletype', 'registrationyear'],
            ['brand', 'model', 'vehicletype', 'gearbox'],
            ['brand', 'model', 'vehicletype'],
            ['brand', 'model', 'fueltype', 'vehicletype'],
            ['brand', 'model', 'fueltype', 'year_bin'],
            ['brand', 'model', 'fueltype', 'registrationyear'],
            ['brand', 'model', 'fueltype', 'gearbox'],
            ['brand', 'model', 'year_bin'],
            ['brand', 'model', 'registrationyear'],
            ['brand', 'model', 'pc_bin'],
            ['brand', 'model'],
            ['brand', 'vehicletype'],
            ['brand', 'year_bin'],
            ['brand', 'registrationyear'],
            ['brand', 'gearbox'],
            ['brand', 'pc_bin'],
            ['brand']
        ]
        total_filled += fill_column('power', power_strategies, condition=lambda x: x == 0)

        if verbose:
            print(f"🔁 Iteration {iteration} filled {total_filled:,} total values")

        if not repeat_until_change or total_filled == 0:
            if verbose:
                print("🏁 No further changes detected, stopping.")
            break

    return df
In [191]:
df_app13 = fill_missing_power(df_app12, threshold = 0.51)
🌀 Iteration 1 starting...
  → Starting with 7,635 missing values in 'power'
✅ Filled 10 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (7,625 remaining, took 11.12s)
✅ Filled 25 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (7,600 remaining, took 29.80s)
✅ Filled 5 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (7,595 remaining, took 9.28s)
✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'] (7,594 remaining, took 7.19s)
✅ Filled 2 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (7,592 remaining, took 4.97s)
✅ Filled 48 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (7,544 remaining, took 4.30s)
✅ Filled 178 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (7,366 remaining, took 10.23s)
✅ Filled 7 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (7,359 remaining, took 3.96s)
✅ Filled 13 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (7,346 remaining, took 7.59s)
✅ Filled 4 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (7,342 remaining, took 3.17s)
✅ Filled 5 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (7,337 remaining, took 3.34s)
✅ Filled 37 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (7,300 remaining, took 3.32s)
✅ Filled 60 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (7,240 remaining, took 6.95s)
✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (7,239 remaining, took 3.12s)
✅ Filled 2 values in 'power' using ['brand', 'model', 'registrationyear'] (7,237 remaining, took 5.11s)
✅ Filled 13 values in 'power' using ['brand', 'model', 'pc_bin'] (7,224 remaining, took 3.33s)
✅ Filled 1 values in 'power' using ['brand', 'model'] (7,223 remaining, took 2.44s)
✅ Filled 1 values in 'power' using ['brand', 'vehicletype'] (7,222 remaining, took 2.41s)
✅ Filled 5 values in 'power' using ['brand', 'year_bin'] (7,217 remaining, took 2.39s)
🔁 Iteration 1 filled 418 total values

🌀 Iteration 2 starting...
  → Starting with 7,217 missing values in 'power'
🔁 Iteration 2 filled 0 total values
🏁 No further changes detected, stopping.
In [192]:
print(df_app13.memory_usage(deep=True).sum() / 1_000_000, "MB")
258.348599 MB
In [193]:
gc.collect()
Out[193]:
0
In [194]:
del df_ap4
del df_app4
del df_app5
del df_app6
del df_app7
del df_app8
del df_app9
del df_app10
del df_app11
del df_app12
In [195]:
df_app14 = df_app13[df_app13['power'] != 0]
In [196]:
df_app15 = df_app14.drop(columns = ['year_bin','pc_bin', 'registration_correction'])
In [197]:
df_app15['notrepaired'] = df_app15['notrepaired'].fillna('unknown')
In [198]:
df_app15.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 330602 entries, 0 to 339980
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   datecrawled        330602 non-null  object        
 1   price              330602 non-null  int64         
 2   vehicletype        330602 non-null  object        
 3   registrationyear   330602 non-null  float64       
 4   gearbox            330602 non-null  object        
 5   power              330602 non-null  float64       
 6   model              330602 non-null  object        
 7   mileage            330602 non-null  int64         
 8   registrationmonth  330602 non-null  int64         
 9   fueltype           330600 non-null  object        
 10  brand              330602 non-null  object        
 11  notrepaired        330602 non-null  object        
 12  datecreated        330602 non-null  datetime64[ns]
 13  numberofpictures   330602 non-null  int64         
 14  postalcode         330602 non-null  int64         
 15  lastseen           330602 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(5), object(8)
memory usage: 42.9+ MB
In [199]:
df_app15.to_pickle('checkpoint_02.pkl')
In [200]:
petrol = (df_app15['fueltype'] == 'gasoline')
df_app15.loc[petrol,['fueltype']] = 'petrol'

petrol = (df_app15['fueltype'].isna())
df_app15.loc[petrol,['fueltype']] = 'petrol'

del petrol

DType Clean Up¶

In [201]:
# 1. Fix datetime columns
date_cols = ['datecrawled', 'lastseen']
for col in date_cols:
    df_app15[col] = pd.to_datetime(df_app15[col], errors='coerce')
In [202]:
# 2. Convert numeric columns to efficient types
# registrationyear & power should not be floats
df_app15['registrationyear'] = df_app15['registrationyear'].astype('int')
df_app15['power'] = df_app15['power'].astype('int')
In [203]:
# 3. Clean up memory
gc.collect()

print("Final memory usage:", df_app15.memory_usage(deep=True).sum() / 1_000_000, "MB")
print(df_app15.dtypes)
Final memory usage: 152.521751 MB
datecrawled          datetime64[ns]
price                         int64
vehicletype                  object
registrationyear              int64
gearbox                      object
power                         int64
model                        object
mileage                       int64
registrationmonth             int64
fueltype                     object
brand                        object
notrepaired                  object
datecreated          datetime64[ns]
numberofpictures              int64
postalcode                    int64
lastseen             datetime64[ns]
dtype: object
In [204]:
df_app15['datecrawled_year'] = df_app15['datecrawled'].dt.year
df_app15['datecrawled_month'] = df_app15['datecrawled'].dt.month.astype('object')

df_app15['datecreated_year'] = df_app15['datecreated'].dt.year
df_app15['datecreated_month'] = df_app15['datecreated'].dt.month.astype('object')

df_app15['lastseen_year'] = df_app15['lastseen'].dt.year
df_app15['lastseen_month'] = df_app15['lastseen'].dt.month.astype('object')

df_app15['postalcode'] = df_app15['postalcode'].astype('object')
df_app15['registrationmonth'] = df_app15['registrationmonth'].astype('object')
In [205]:
df_app15.insert(df_app15.columns.get_loc("datecrawled"), "datecrawled_month", df_app15.pop("datecrawled_month"))
df_app15.insert(df_app15.columns.get_loc("datecrawled") + 1, "datecrawled_year", df_app15.pop("datecrawled_year"))

df_app15.insert(df_app15.columns.get_loc("datecreated"), "datecreated_month", df_app15.pop("datecreated_month"))
df_app15.insert(df_app15.columns.get_loc("datecreated") + 1, "datecreated_year", df_app15.pop("datecreated_year"))

df_app15.insert(df_app15.columns.get_loc("lastseen"), "lastseen_month", df_app15.pop("lastseen_month"))
df_app15.insert(df_app15.columns.get_loc("lastseen") + 1, "lastseen_year", df_app15.pop("lastseen_year"))
In [206]:
df_app15 = df_app15.drop(columns=['datecrawled', 'datecreated', 'lastseen'])
In [207]:
df_app15.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 330602 entries, 0 to 339980
Data columns (total 19 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   datecrawled_month  330602 non-null  object
 1   datecrawled_year   330602 non-null  int64 
 2   price              330602 non-null  int64 
 3   vehicletype        330602 non-null  object
 4   registrationyear   330602 non-null  int64 
 5   gearbox            330602 non-null  object
 6   power              330602 non-null  int64 
 7   model              330602 non-null  object
 8   mileage            330602 non-null  int64 
 9   registrationmonth  330602 non-null  object
 10  fueltype           330602 non-null  object
 11  brand              330602 non-null  object
 12  notrepaired        330602 non-null  object
 13  datecreated_month  330602 non-null  object
 14  datecreated_year   330602 non-null  int64 
 15  numberofpictures   330602 non-null  int64 
 16  postalcode         330602 non-null  object
 17  lastseen_month     330602 non-null  object
 18  lastseen_year      330602 non-null  int64 
dtypes: int64(8), object(11)
memory usage: 50.4+ MB

DataFrame Comparison¶

In [208]:
gc.collect()
Out[208]:
43
In [209]:
coupe = df[df['vehicletype'] == 'coupe']
suv = df[df['vehicletype'] == 'suv']
small = df[df['vehicletype'] == 'small']
sedan = df[df['vehicletype'] == 'sedan']
convertible = df[df['vehicletype'] == 'convertible']
bus = df[df['vehicletype'] == 'bus']
wagon = df[df['vehicletype'] == 'wagon']
In [210]:
ncoupe = df_app15[df_app15['vehicletype'] == 'coupe']
nsuv = df_app15[df_app15['vehicletype'] == 'suv']
nsmall = df_app15[df_app15['vehicletype'] == 'small']
nsedan = df_app15[df_app15['vehicletype'] == 'sedan']
nconvertible = df_app15[df_app15['vehicletype'] == 'convertible']
nbus = df_app15[df_app15['vehicletype'] == 'bus']
nwagon = df_app15[df_app15['vehicletype'] == 'wagon']
In [211]:
coupe['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Coupes per Brand: Before Data Cleaning')
plt.show()

ncoupe['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Coupes per Brand: After Data Cleaning')
plt.show()

plt.figure(figsize=(14,16))
sns.boxplot(data=coupe, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Coupe Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()

plt.figure(figsize=(14,16))
sns.boxplot(data=ncoupe, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Coupe Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [212]:
del coupe
del ncoupe
In [213]:
suv['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of SUVs per Brand: Before Data Cleaning')
plt.show()

nsuv['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of SUVs per Brand: After Data Cleaning')
plt.show()

plt.figure(figsize=(14,16))
sns.boxplot(data=suv, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of SUVs Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()

plt.figure(figsize=(14,16))
sns.boxplot(data=nsuv, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of SUVs Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [214]:
del suv
del nsuv
In [215]:
small['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Small per Brand: Before Data Cleaning')
plt.show()

nsmall['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Small per Brand: After Data Cleaning')
plt.show()

plt.figure(figsize=(14,16))
sns.boxplot(data=small, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Small Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()


plt.figure(figsize=(14,16))
sns.boxplot(data=nsmall, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Small Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [216]:
del small
del nsmall
In [217]:
sedan['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Sedan per Brand: Before Data Cleaning')
plt.show()

nsedan['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Sedan per Brand: After Data Cleaning')
plt.show()

plt.figure(figsize=(14,16))
sns.boxplot(data=sedan, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Sedan Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()


plt.figure(figsize=(14,16))
sns.boxplot(data=nsedan, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Sedan Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [218]:
del sedan
del nsedan
In [219]:
convertible['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Convertibles per Brand: Before Data Cleaning')
plt.show()

nconvertible['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Convertibless per Brand: After Data Cleaning')
plt.show()

plt.figure(figsize=(14,16))
sns.boxplot(data=convertible, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Convertibles Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()



plt.figure(figsize=(14,16))
sns.boxplot(data=nconvertible, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Convertible Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [220]:
del convertible
del nconvertible
In [221]:
bus['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Buses per Brand: Before Data Cleaning')
plt.show()

nbus['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Buses per Brand: After Data Cleaning')
plt.show()

plt.figure(figsize=(14,16))
sns.boxplot(data=bus, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Bus Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()


plt.figure(figsize=(14,16))
sns.boxplot(data=nbus, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Bus Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [222]:
del bus
del nbus
In [223]:
wagon['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Wagons per Brand: Before Data Cleaning')
plt.show()

nwagon['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Wagons per Brand: After Data Cleaning')
plt.show()

plt.figure(figsize=(14,16))
sns.boxplot(data=wagon, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Wagon Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()


plt.figure(figsize=(14,16))
sns.boxplot(data=nwagon, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Wagon Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [224]:
df['price'].hist(bins=20)
plt.show()

df_app15['price'].hist(bins=20)
plt.show()
No description has been provided for this image
No description has been provided for this image
In [225]:
del df_app2
del df_app13
del df_app14
del df_car
del df_app3
del df_vetype
del df_reg
del df_model_x
del df_vt
del df_model
del df_app
del df_ft
del wagon
del nwagon
del vt_power
del remainder_models
del ft
del bora_to_jetta
del jetta16
del captiva
del matiz68
del matiz52
del matiz67
del re_1
del passat
del passat1
del passat2
del passat3
del passat4
del passat5
del passat6
del golf
del passat140
del golf90
del passat90
del golf75
del golf7502
del passat105
del passat131
del passat116
del passat150
del passat115
del passat170
del golf110
del golf60
del polo60
del passat125
del passat100
del passat174
del passat130
del passat120
del audi75
del bmw75
del opelsedan60
del opel9160
del opelastra
del opelcorsa
del astraopel
del opelcombo
gc.collect()
Out[225]:
66895
In [ ]:
del civic75
del mini75
del nissan60
del seat60

Model Training¶

In [226]:
import sys

def show_memory_usage():
    vars_list = []
    for name, obj in globals().items():
        if not name.startswith('_'):
            size_mb = sys.getsizeof(obj) / (1024**2)
            if size_mb > 1:  # Only show objects > 1MB
                vars_list.append((name, size_mb, type(obj).__name__))
    
    vars_list.sort(key=lambda x: x[1], reverse=True)
    print("\n🔍 Memory Usage:")
    for name, size, dtype in vars_list[:10]:
        print(f"  {name}: {size:.2f} MB ({dtype})")

# Use it throughout your notebook
show_memory_usage()
🔍 Memory Usage:
  df: 222.44 MB (DataFrame)
  df1: 213.99 MB (DataFrame)
  df_newest: 211.90 MB (DataFrame)
  df_newer: 211.77 MB (DataFrame)
  df_new: 211.75 MB (DataFrame)
  df_app15: 197.05 MB (DataFrame)
  civic75: 11.10 MB (Series)
  mini75: 11.10 MB (Series)
  nissan60: 11.10 MB (Series)
  seat60: 11.10 MB (Series)
In [227]:
data = df_app15.copy()
In [228]:
del df_app15
gc.collect()
Out[228]:
0
In [247]:
# If the kernel crashes:
# import libraries (Go to the top - press ctrl+F and type libraries to get there faster - run the libraries)
# data = pd.read_pickle('checkpoint_03.pkl') <-- copy this on a new line right below, run it
# This is a checkpoint to start off with the data DF
data.to_pickle('checkpoint_03.pkl')
In [230]:
data = pd.read_pickle('checkpoint_03.pkl')
In [231]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 330602 entries, 0 to 339980
Data columns (total 19 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   datecrawled_month  330602 non-null  object
 1   datecrawled_year   330602 non-null  int64 
 2   price              330602 non-null  int64 
 3   vehicletype        330602 non-null  object
 4   registrationyear   330602 non-null  int64 
 5   gearbox            330602 non-null  object
 6   power              330602 non-null  int64 
 7   model              330602 non-null  object
 8   mileage            330602 non-null  int64 
 9   registrationmonth  330602 non-null  object
 10  fueltype           330602 non-null  object
 11  brand              330602 non-null  object
 12  notrepaired        330602 non-null  object
 13  datecreated_month  330602 non-null  object
 14  datecreated_year   330602 non-null  int64 
 15  numberofpictures   330602 non-null  int64 
 16  postalcode         330602 non-null  object
 17  lastseen_month     330602 non-null  object
 18  lastseen_year      330602 non-null  int64 
dtypes: int64(8), object(11)
memory usage: 50.4+ MB

Train/Validate Split¶

In [232]:
features = data.drop('price', axis=1)
target = data['price']

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, 
    test_size=0.25, 
    random_state=12345
)

# Identify categorical columns
cat_cols = features_train.select_dtypes(include=['object','category']).columns
num_cols = features_train.select_dtypes(exclude=['object','category']).columns

features_train = features_train.copy()
features_valid = features_valid.copy()

features_train.loc[:, cat_cols] = features_train[cat_cols].astype(str)
features_valid.loc[:, cat_cols] = features_valid[cat_cols].astype(str)
In [233]:
def evaluate_model(name, model, features_train, target_train, features_valid, target_valid, cat_features=None):
    print(f"\nTraining {name}...")

    start_train = time.time()

    if cat_features is not None:
        model.fit(features_train, target_train, cat_features=cat_features)
    else:
        model.fit(features_train, target_train)

    train_time = time.time() - start_train

    start_pred = time.time()
    preds = model.predict(features_valid)
    pred_time = time.time() - start_pred
    
    rmse = mean_squared_error(target_valid, preds, squared=False)

    print(f"{name}: RMSE={rmse:.3f}, TrainTime={train_time:.2f}s, PredTime={pred_time:.4f}s")

    return {
        'Model': name,
        'RMSE': rmse,
        'Train_Time': train_time,
        'Predict_Time': pred_time
    }
In [234]:
gc.collect()
Out[234]:
0
In [235]:
ohe_processor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', dtype = int), cat_cols)
    ],
    remainder='passthrough'
)

lr_model = Pipeline([
    ('ohe', ohe_processor),
    ('lr', LinearRegression())
])

results = []
results.append(
    evaluate_model('Linear Regression Model', lr_model, features_train, target_train, features_valid, target_valid)
)
Training Linear Regression Model...
Linear Regression Model: RMSE=2864.931, TrainTime=1.21s, PredTime=0.3178s
In [236]:
gc.collect()
Out[236]:
52
In [237]:
# DecisionTree
dt_model = Pipeline([
    ('ohe', ohe_processor),
    ('dt', DecisionTreeRegressor(
        max_depth=20,
        min_samples_leaf=4,
        random_state=12345
    ))
])

results.append(
    evaluate_model('Decision Tree Model', dt_model, features_train, target_train, features_valid, target_valid)
)
Training Decision Tree Model...
Decision Tree Model: RMSE=1904.375, TrainTime=27.29s, PredTime=0.2165s
In [238]:
gc.collect()
Out[238]:
52
In [239]:
# Random Forest
rf_model = Pipeline([
    ('ohe', ohe_processor),
    ('rf', RandomForestRegressor(
        n_estimators=100,
        max_depth=20,
        random_state=12345,
        n_jobs=-1
    ))
])

results.append(
    evaluate_model('Random Forest', rf_model, features_train, target_train, features_valid, target_valid)
)
Training Random Forest...
Random Forest: RMSE=1686.247, TrainTime=1160.16s, PredTime=0.7960s
In [240]:
gc.collect()
Out[240]:
80
In [241]:
# results_df = pd.read_pickle('checkpoint_04b.pkl')
results_df = pd.DataFrame(results)
results_df.to_pickle('checkpoint_04a.pkl')
In [242]:
# CATBOOST
cat_features = [features_train.columns.get_loc(c) for c in cat_cols]

cat_model = CatBoostRegressor(
    depth=8,
    learning_rate=0.1,
    iterations=500,
    loss_function='RMSE',
    verbose=False,
    random_seed=12345
)

results.append(
    evaluate_model(
        'CatBoost',
        cat_model,
        features_train,
        target_train,
        features_valid,
        target_valid,
        cat_features=cat_features
    )
)
Training CatBoost...
CatBoost: RMSE=1636.975, TrainTime=243.67s, PredTime=0.6921s
In [243]:
cat_cols = list(cat_cols)

for col in cat_cols:
    features_train[col] = features_train[col].astype("category")
    features_valid[col] = features_valid[col].astype("category")
In [244]:
# XGBOOST

xgb_model = Pipeline(steps=[
    ('preprocess', ohe_processor),
    ('model', XGBRegressor(
        n_estimators=400,
        learning_rate=0.05,
        max_depth=8,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=12345,
        objective='reg:squarederror',
        n_jobs=-1
    ))
])

results.append(
    evaluate_model(
        "XGBoost", 
         xgb_model, 
        features_train, target_train, features_valid, target_valid)
)
Training XGBoost...
XGBoost: RMSE=1655.304, TrainTime=196.66s, PredTime=1.1021s
In [245]:
# LightGBM datasets
lgb_train = lgb.Dataset(
    features_train,
    label=target_train
)

lgb_valid = lgb.Dataset(
    features_valid,
    label=target_valid,
    reference=lgb_train
)


# LightGBM Set 1

params_set1 = {
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'verbose': -1
}


print("\nTraining LightGBM (Set 1)...")
start1 = time.time()
lgb_model1 = lgb.train(
    params_set1, 
    lgb_train, 
    valid_sets=[lgb_valid], 
    num_boost_round=300,
    callbacks=[lgb.early_stopping(stopping_rounds=50)]
)
train_time1 = time.time() - start1

start_pred1 = time.time()
preds1 = lgb_model1.predict(features_valid)
pred_time1 = time.time() - start_pred1
rmse1 = mean_squared_error(target_valid, preds1, squared=False)

results.append({
    'Model': 'LightGBM Set 1', 
    'RMSE': rmse1,
    'Boosting_Rounds': lgb_model1.best_iteration,
    'Train_Time': train_time1,
    'Predict_Time': pred_time1
})


# LightGBM Set 2

params_set2 = {
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 64,
    'learning_rate': 0.1,
    'verbose': -1
}

print("\nTraining LightGBM (Set 2)...")
start2 = time.time()
lgb_model2 = lgb.train(
    params_set2, 
    lgb_train, 
    valid_sets=[lgb_valid], 
    num_boost_round=500,
    callbacks=[lgb.early_stopping(stopping_rounds=50)]
)
train_time2 = time.time() - start2

start_pred2 = time.time()
preds2 = lgb_model2.predict(features_valid)
pred_time2 = time.time() - start_pred2
rmse2 = mean_squared_error(target_valid, preds2, squared=False)

results.append({
    'Model': 'LightGBM Set 2', 
    'RMSE': rmse2,
    'Boosting_Rounds': lgb_model2.best_iteration,
    'Train_Time': train_time2,
    'Predict_Time': pred_time2
})


print(f"LightGBM Set 1: RMSE={rmse1:.3f}, TrainTime={train_time1:.2f}, PredTime={pred_time1:.2f}")
print(f"LightGBM Set 2: RMSE={rmse2:.3f}TrainTime={train_time2:.2f}, PredTime={pred_time2:.2f}")
Training LightGBM (Set 1)...
/.venv/lib/python3.9/site-packages/lightgbm/basic.py:1780: UserWarning: Overriding the parameters from Reference Dataset.
  _log_warning('Overriding the parameters from Reference Dataset.')
/.venv/lib/python3.9/site-packages/lightgbm/basic.py:1513: UserWarning: categorical_column in param dict is overridden.
  _log_warning(f'{cat_alias} in param dict is overridden.')
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[300]	valid_0's rmse: 1684.32

Training LightGBM (Set 2)...
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[500]	valid_0's rmse: 1627.28
LightGBM Set 1: RMSE=1684.316, TrainTime=28.57, PredTime=1.42
LightGBM Set 2: RMSE=1627.283TrainTime=57.51, PredTime=4.37

Model analysis¶

In [246]:
# RESULTS TABLE

results_df = pd.DataFrame(results)
results_df.sort_values(by='RMSE', inplace=True)
results_df.reset_index(drop=True, inplace=True)

print("\n\nFINAL MODEL COMPARISON:")
print(results_df.to_string())

FINAL MODEL COMPARISON:
                     Model         RMSE   Train_Time  Predict_Time  Boosting_Rounds
0           LightGBM Set 2  1627.283377    57.508679      4.374746            500.0
1                 CatBoost  1636.975409   243.666996      0.692125              NaN
2                  XGBoost  1655.304347   196.659249      1.102127              NaN
3           LightGBM Set 1  1684.315965    28.574041      1.416330            300.0
4            Random Forest  1686.246985  1160.163886      0.796043              NaN
5      Decision Tree Model  1904.374673    27.287978      0.216526              NaN
6  Linear Regression Model  2864.931041     1.208937      0.317789              NaN

Final Conclusion¶

This project successfully developed and evaluated multiple machine learning models to predict used car prices for Rusty Bargain's mobile application. The analysis focused on three critical metrics: prediction quality (RMSE), prediction speed, and training time.

image.png

Key Findings¶

Best Overall Model: LightGBM Set 2

  • Achieved the lowest RMSE of 1,627.28 euros, representing the most accurate predictions
  • Demonstrated reasonable training time (approximately 58 seconds) and fast prediction speed (approximately 4 seconds)
  • Utilized 500 boosting rounds

Model Performance Summary:

  1. Top performers (RMSE < 1,700): LightGBM Set 2, CatBoost, and XGBoost all delivered strong predictive accuracy
  2. CatBoost offered the fastest prediction time (0.69 seconds) while maintaining excellent accuracy (1,636.98 RMSE)
  3. Random Forest provided competitive accuracy (1,686.25 RMSE) but required significantly longer training time (1,160 seconds)
  4. Linear Regression served as an effective sanity check with RMSE of 2,864.93, confirming that gradient boosting methods substantially outperformed the baseline

Trade-offs Analysis¶

For Production Deployment:

  • If prediction speed is critical: CatBoost is recommended with sub-second prediction time and only marginally lower accuracy than LightGBM
  • If accuracy is paramount: LightGBM Set 2 provides the best predictions while maintaining reasonable computational requirements
  • For balanced performance: XGBoost offers strong accuracy with moderate training and prediction times

Technical Approach¶

The project successfully:

  • Cleaned and preprocessed 330,000+ records with extensive missing value imputation using hierarchical grouping strategies
  • Implemented proper categorical encoding (label encoding for LightGBM/CatBoost, one-hot encoding for XGBoost)
  • Validated that gradient boosting methods significantly outperformed traditional algorithms

Recommendation¶

For Rusty Bargain's mobile application, I recommend deploying LightGBM Set 2 as the primary model, with CatBoost as a secondary option if real-time prediction speed becomes a bottleneck. Both models achieve RMSE under 1,650 euros, meaning predictions are typically within this margin of the actual price—acceptable accuracy for a used car valuation tool.

The gradient boosting approaches demonstrated clear superiority over simpler methods, justifying their computational overhead for this business application where prediction accuracy directly impacts customer trust and satisfaction.

Checklist¶

Type 'x' to check. Then press Shift+Enter.

  • Jupyter Notebook is open
  • Code is error free
  • The cells with the code have been arranged in order of execution
  • The data has been downloaded and prepared
  • The models have been trained
  • The analysis of speed and quality of the models has been performed
In [ ]: